Previous: Empirical Review of General Purpose Computing Architectures in the Age of MOS VLSI Up: Empirical Review Next: High Diversity on Reconfigurables
In this segment we review hardwired, programmable, and configurable multiply implementations. The custom multiplier implementations show us the functional density achievable by custom hardware on its intended task for comparison with the general-purpose structures reviewed in Chapter .
We use the multiply operation for this comparison because it is relatively simple and important to many computing tasks including signal processing. Because of its importance and regularity, it has received much attention over the years including many, high quality, custom implementations. Multiply is probably one of the first computational operators to be implemented in most new VLSI processes. Considering the amount of attention given to custom multiply implementations, the comparison between custom multiplies and configurable implementations represents an upper bound on the performance disparity between custom and configurable implementations. Few functions, if any, should show a larger disparity, and most show a significantly smaller disparity. Multiply is also interesting since it is the first piece of custom logic added to ``general-purpose'' processors.
In this section we use a domain specific metric for functional capacity, the multiply bit operation (). To allow us to compare multiplies of various sizes, we assume each multiply requires . As such, we metric multiply functional density in and compute it as shown in Equation .
An multiply can be done in less than operations (see for example [Knu81]), but, for the multiplies reviewed here, all of the circuits and algorithms do scale as .
Table summarizes the performance of numerous custom multipliers according to Equation . Implementations range from sub 1000 to almost 9000 with 2000-4000 representing the range of typical, high-performance, custom multipliers. Like processors there is no clear trend for improvement with time or decreasing feature size. The latest designs, if anything, show a tendency to emphasize latency over throughput resulting in lower functional density.
Table shows a few, sample, semicustom multiplier implementations. At 330 and 560 , the gate array and standard cell implementations provide a factor of 5-10 less functional density than the custom implementations.
For comparison, Table summarizes the capacity density of several configurable and programmable implementations. Processors without specialized multiply support show a factor of 10,000 lower performance density than hardwired multipliers. Processors, with multiply or booth step operations have only a factor of 1,000 lower performance density. FPGAs are a factor of 100-300 less dense than custom hardware. Processors, DSPs, and reconfigurable ALUs with integrated multipliers are only a factor of 10-20 lower in performance density. Figure shows these basic relationships.
One thing we note from Table is that processors with integrated multipliers provide roughly 10% of the performance density of a custom multiplier. This comes about simply by dedicating 10% of the processor real-estate to hold a custom multiplier. Because of the importance of the multiply function in many applications and the 100-1,000 performance density differential achievable by setting aside this 10%, many processors and all DSPs augment the general-purpose core with a hardwired multiplier. Custom multiply and floating-point logic are the two main piece of custom logic which have been regularly integrated onto conventional ``general-purpose'' computing devices for this reason.
A custom multiplier is often called upon to perform multiplies for a variety of data sizes. When multiplying operands smaller than the native multiply size, the custom multiplier yields lower multiply functional density than indicated in Table . Table compares the yielded capacity of the various custom and programmable multipliers reviewed above.
In many applications, one of the operands in the multiply is a constant -- or changing slowly. In these case, the operation complexity is slightly reduced, in general, and may be greatly reduce in particular circumstances. Hardwired, 2-operand, multipliers cannot take advantage of this reduced complexity whereas programmable and configurable devices can. Table summarizes the multiply capacity provided on specialized multiplies. For comparison with the previous tables, the multiply capacity density is calculated as if it is performing a full multiply. It might be more accurate to say the complexity of the problem decreased rather than the density of multiply bit ops increased, but the ratio of the performance density numbers is the same whichever way we view it. Note that the densities shown in Table apply for any constant operand. Particular operands may admit to much tighter implementations.
In general, reconfigurable devices achieve 100-300 lower capacity density than their custom multiply counterparts. At the same time, they achieve 10-30 better performance than a processor building a multiply out of ALU operations. For this particular operation, most processors include a specialized multiply-step operation, which brings them closer to parity with the reconfigurable devices, or integrate a custom multiplier, which gives them a 10 advantage over the reconfigurable devices. Reconfigurable devices which also include custom multiply support achieve about the same multiply density as processor with integrated, custom, multipliers. When large, custom multiplier arrays are used on small data, the gap between the custom devices and the reconfigurable devices narrows. Similarly, when a multiply operand is constant or slowly changing, reconfigurable devices may exploit the reduction in operation complexity to narrow the density gap.