Directions in General-Purpose Computing Architectures

by André DeHon

General-purpose computing devices and systems are commodity building blocks which can be adapted to solve any number of computational tasks. It is this post fabrication adaptability, perhaps more than any other feature, which has enabled and fueled the computer revolution over the past several decades. We adapt these general-purpose devices by feeding them a series of control bits -- i.e. ``programming'' or ``configuring'' the device according to our computational needs. We have traditionally called these bits instructions, as they instruct the programmable silicon on how to function -- i.e. they instruct logical units as to which operations to perform, they instruct interconnect as to which way to route data, and they instruct memories on when to read and write values.

While all general-purpose computing devices have instructions, distinct architectures treat them differently -- and it is precisely the management of device instructions which differentiates various general-purpose computer architectures. When architecting a general-purpose device, we must make decisions on issues such as:

``How many resources are controlled by each instruction?'' (i.e. datapath width),
``How much instruction memory is placed on chip?'' (i.e. instruction memory depth),
``How much area is dedicated to instruction distribution?'', and
``How many threads of control can the device manage simultaneously?''.

The answers to these questions define much of a device's architecture and determine where the general-purpose component is most efficient.

Conventional programmable processors, such as microprocessors, have

moderately wide datapath which have been growing larger over time (e.g. 16, 32, 64 bits),
support for large on-chip instruction caches which have been also been growing larger over time and can now hold hundreds to thousands of instructions
high bandwidth instruction distribution so that one or several instructions may be issued per cycle at the cost of dedicating considerable die area for instruction distribution
a single thread of computation control

As a consequence these devices are efficient on wide word data and irregular tasks -- i.e. tasks which need to perform a large number of distinct operations on each datapath processing element. On tasks with small data, the active computing resources are underutilized, wasting computing potential. On very regular computational tasks, the on-chip space to hold a large sequence of instructions goes largely unused.

In contrast, conventional configurable devices, such as FPGAs, have

narrow datapath (e.g. almost always one bit),
on-chip space for only one instruction per compute element -- i.e. the single instruction which tells the FPGA array cell what function to perform and how to route its inputs and outputs
minimal die area dedicated to instruction distribution such that it takes hundreds of thousands of compute cycles to change the active set of array instructions

As a consequence these devices are efficient on bit-level data and regular tasks -- i.e. tasks which need to repeatedly perform the same collection of operations on data from cycle to cycle. On tasks with large data elements, these fine-grain devices pay excessive area for interconnect and instruction storage versus a coarser-grain device. On very irregular computational tasks, active computing elements are underutilized -- either the array holds all subcomputations required by a task, but only a small subset of the array elements are used at any point in time, or the array holds only the subcomputation needed at each point in time, but must sit idle for long periods of time between computational subtasks while the next subtask's array instructions are being reloaded.

Unfortunately, most real computations are neither purely regular nor irregular, and real computations do not work on data elements of a single data size. Typical computer programs spend most of their time in a very small portion of the code. In the kernel where most of the computational time is spent, the same computation is heavily repeated making it very regular. The rest of the code is used infrequently making it irregular. Further, in systems, a general-purpose computational device is typically called upon to run many applications with differing requirements for datapath size, regularity, and control streams. This broad range of requirements makes it difficult, if not impossible, to achieve robust and efficient performance across entire applications or application sets by selecting a single computational device with the extremes of today's conventional architectures.

Potential solutions to this dilemma reside in hybrid architectures which tightly coupled elements of both extremes and in hybrid architectures which draw from the broad architectural space left open between the extremes represented by conventional processors and conventional FPGAs.

Multiple context FPGAs, such as MIT's DPGA (EET, Jan. 29, page 41 [1]), provide one such intermediate in this architectural space. The DPGA retains the bit-level granularity of FPGAs, but instead of holding a single instruction per active array element, the DPGA stores several instructions per array element. The memory necessary to hold each instruction, is small compared to the area comprising the array element and interconnect which the instruction controls. Consequently, adding a small number of on-chip instructions does not substantially increase die size. The addition does, however, substantially increase the device's ability to efficiently handle more irregular computational tasks. At the same time, a large number of on-chip instructions is not as clearly beneficial. While the instructions are small, their size is not trivial -- supporting a large number of instructions per array element (e.g. tens to hundreds) would cause a substantial increase in die area decreasing the device efficiency on regular tasks.

Multiple context components with moderate datapaths also come down in the intermediate architectural space. Pilkington's VDSP (EET Aug. 7, page 16 [2]) has an 8-bit datapath and space for 4 instruction per datapath element. UC Berkeley's PADDI and PADDI-II ([3]) have a 16-bit datapath and 8 instruction per datapath element. Both of these architectures were originally developed for signal processing applications and can handle semi-regular tasks on small datapaths very efficiently. Here, too, the instructions are small compared to the active datapath computing elements so including 4-8 instructions per datapath substantially increases device efficiency on irregular applications with minimal impact on die area.

While intermediate architectures such as these are often superior to the conventional extremes of processor and FPGAs, any architecture with a fixed datapath width, on-chip instruction depth, and instruction distribution area will always be less efficient than the architecture whose datapath width, local instruction depth, and instruction distribution bandwidth exactly matches the needs of a particular application. Unfortunately, since the space of allocations is large and the requirements change from application to application, it will never make sense to produce every such architecture and, even if we did, a single system would have to choose one of them. Flexible, post fabrication, assembly of datapaths and assignment of routing channels and memories to instruction distribution enables a single component to deploy its resources efficiently, allowing the device to realize the architecture best suited for each application. This is the approach taken by MIT's MATRIX component (EET April 22, page 33 [4]).

Since many tasks have a mix of irregular and regular computing tasks, a hybrid architecture which tightly couples arrays of mixed datapath sizes and instruction depths along with flexible control can often provided the most robust performance across the entire application. In the simplest case, such an architecture might couple an FPGA (or DPGA) array into a conventional processor, allocating the regular, fine-grained tasks to the array, and the irregular, coarse-grained tasks to the conventional processor. Such coupled architectures are now being studied by several groups (See [5]).

In summary, we see that conventional, general-purpose device architectures, both microprocessors and FPGAs, live at extreme ends of a rich architectural space. As feature sizes shrink and the available computing die real-estate grows, microprocessors have traditionally gone to wider datapaths and deeper instruction and data caches, while FPGAs have maintained single-bit granularity and a single instruction per array element. This trend has widened the space between the two architectural extremes, and accentuated the realm where each is efficient. A more effective use of the silicon area now becoming available for the construction of general-purpose computing components lies in the space between these extremes. In this space, we see the emergence of intermediate architectures, architectures with flexible resource allocation, and architectures which mix components from multiple points in the space. Both processors and FPGAs stand to learn from each other's strengths. In processor design, we will learn that not all instructions need to change on every cycle, allowing us to increase the computational work done per cycle without correspondingly increasing on-chip instruction memory area or instruction distribution bandwidth. In reconfigurable device design, we will learn that a single instruction per datapath is limiting and that a few additional instructions are inexpensive, allowing the devices to cope with a wider range of computational tasks efficiently.

[1] -- <http://www.ai.mit.edu/projects/transit/dpga_prototype_documents.html>
[2] -- <http://www.pmel.com/dsp.html>
[3] -- <http://infopad.eecs.berkeley.edu/research/tools/Paddi/>
[4] -- <http://www.ai.mit.edu/projects/transit/matrix_documents.html>
[5] -- <http://www.cs.berkeley.edu/projects/brass/reproc.html>

André DeHon <andre@mit.edu> Reinventing Computing MIT AI Lab