Previous: Structure and Composition of Reconfigurable Computing Devices Up: Structure and Composition of Reconfigurable Computing Devices Next: Instructions
Programmable interconnect is the dominant contributor to die area and cycle time in configurable devices. To support their large, active functional density, the computational units must be richly interconnected and support highly parallel data routing. FPGAs, more than other general-purpose devices, place most of their area into interconnect.
We review interconnect issues in the context of on-chip networks for reconfigurable architectures. We establish typical size and delay contributions by analyzing conventional FPGA implementations, then we look at how resource requirements grow with increasing array size. Understanding conventional sizes and growth factors help us characterize the design space. It also serves as background context for the architectural developments described in Part .
In this chapter, we:
This breakdown, alone, shows us one reason why a full 4-input lookup table is often used as the programmable logic element, rather than a more restricted gate. The area required for the full LUT, including its configuration memory, is less than 10% of the area of the 4-LUT cell, such that there is little advantage to reducing the cell's functional size.
The number of programming bits per 4-LUT for these devices is summarized in Table . Using a rather large memory cell (4.5K/bit), the memory accounted for 35% of the area on UTFPGA. With 4-contexts and 600 3T-DRAM memory cells, memory only occupied 33% of the area on the DPGA. If we assume 1000 static memory cells, for the Xilinx parts, memory accounts for about 15-30% of that area ( (32%), , (15%), , (32%), , (21%)). Making similar assumptions, memory accounts for 21% of an Altera 8K part () and 11% () of an Orca 2C part. Interconnect and routing occupies the balance of the area (70-90%).
Most vendors lump interconnect timing in with lookup table evaluation, making it difficult to distinguish the components of delay. Table summaries interconnect and LUT logic delay for Altera's 8K series [Alt95] and our own experience with the DPGA (Chapter ). From here, we see that interconnect typically accounts for 80% of the path delay.
FPGA networks, which already need to interconnect thousands of independent processing elements, do not, typically, look like conventional multiprocessor networks. In particular, a number of conceptually ``simple'' network structures commonly used as the basis for multiprocessor networks do not scale properly for use in FPGAs. In this section, we review three typical organizations and highlight their shortcomings on the scale required for FPGA networks.
To guarantee arbitrary, full, connectivity among elements, we could build a a full crossbar for the interconnection network. In such a scheme we would not have to worry about whether or not a given network could be mapped onto the programmable interconnect nor would we have to worry about where logic elements were placed. Unfortunately, the cost for this full interconnect is prohibitively high.
For an element array where each element is a -input function ( e.g. -LUT), the crossbar would be an crossbar. Arranged in a roughly square array, each input and output must travel distance, before we account for saturated wire density. Since interconnect delay is proportional to interconnect distance, this implies the interconnect delay grows at least as . However, the bisection bandwidth for any full crossbar is . For sufficiently large , this bisection bandwidth requires that the side of an array be to accommodate the wires across the bisection. In turn, the bisection bandwidth dictates an area . This also dictates input and output wires of length . For large crossbars, wire size dominates the areas. These growth rates are not acceptable even at the level of thousands of LUTs. If we were to build devices using a single monolithic crossbar for interconnect:
Consider, for the sake of illustration, the size of a crossbar required to interconnect a 2,500 4-LUT device. We will assume the minimum wire pitch is 8 and the crossbar is implemented with two layers of dense metal routed at this minimum wire-pitch. The area of such an array, as dictated simply by the wiring would be: Making for an area of per 4-LUT just to handle the requisite wiring. Conventional FPGAs use a single SRAM cell to configure each of the crosspoints in the crossbar. If this were done, the area would be memory bit dominated rather than wire dominated and take up: Which results in 10M per 4-LUT just to hold the configuration memory. The area per LUT, of course, continues to grow linearly in the number of LUTs for larger networks.
Multistage interconnection networks ( e.g. butterfly, omega, CLOS, Benes) can reduce the total number of switches required from to , but have the same bisection bandwidth problem. Between any two pair of stages in a butterfly network, the total bisection bandwidth is , such that the wiring requirements dictate that area grows at .
At the opposite interconnect extreme, we can use only local connections within the array between adjacent, or close, array elements. By limiting all the connections to fixed distances, the link delay does not grow as the array grows. Further, the bisection bandwidth in a mesh configuration is and hence, never dominates the logical array element size. However, communicating a piece of data between two points in the array requires switching delay proportional to the distance between the source and the destination. Since switching delay through programmable interconnect is generally much larger than fanout or wire propagation delay along a fixed wire, this makes distant communication slow and expensive. For example, in a topology where direct connections are only made between an array element and its north, east, south, and west neighbors (typically called a NEWS network), a signal must traverse a number of programmable switching elements proportional to the Manhattan distance between the source and the destination (). For the interconnect network topologies typically encountered in logic circuits, this can make interconnect delay quite high -- easily dominating the delay through the array element logic.
With this background, we can begin to formulate the design requirements for programmable interconnect:
Conventional FPGA interconnect takes a hybrid approach with a mix of short, neighbor connections and longer connections. Figure shows a canonical FPGA LUT tile. Full connectivity is not supported even within the interconnect of a single tile. Typically, the interconnect block includes:
The University of Toronto has performed a number of empirical interconnect studies aimed at establishing basic FPGA interconnect characteristics, including:
One of the key differences between FPGAs and traditional ``multiprocessor'' networks is that FPGA interconnect paths are locked down serving a single function. The FPGA must be able to simultaneously route all source-sink connections using unique resources to realize the connectivity required by the FPGA. Another key difference is that the interconnection pattern is known a prior to execution, so offline partitioning and placement can be used to exploit locality and thereby reduce the interconnect requirements.
Before we examine how network requirements scale with connectivity and network size, in this section, we briefly review the number of switches conventionally employed by networks supporting 100 to 1000 4-LUTs. Brown and Rose [BFRV92][RB91] suggest each 4-LUT in a moderate sized FPGA with 100's of 4-LUTs will require 200-400 switches. Agarwal and Lewis suggest approximately 100 switches per LUT for hierarchical FPGAs [AL94] with some reduction in logic utilization. Conventional, commercial FPGAs do little or no encoding on their interconnect bit streams -- that is, each interconnect switch is controlled by a single configuration bit. From the configuration bit summary in Table , we see that commercial devices also exhibit on the order of 200 switches per 4-LUT. The fact that conventional FPGAs can, with difficulty, route most all designs using less than 80-90% of the device LUTs, suggests that they chose a number of switches which provides reasonably ``adequate'' interconnect for the current device sizes -- hundreds to a couple of thousand 4-LUTs.
In Sections and , we have empirically established the size of conventional interconnect. However, as we glimpsed in Section , the area which these resources occupy is not necessarily independent of the number of LUTs interconnected. In this section we look at how interconnect requirements will grow with the number of LUTs supported.
The best characterization to date which empirically meters interconnect requirements is Rent's Rule [Vil82][LR71]:
is the number number of interconnection in/out of a region containing . and are empirical constants. For logic functions , typically.
El Gamal used a stochastic model to estimate the interconnection requirements for channeled gate arrays [Gam81]. He found that each routing channel requires tracks if the average wire length, , grows faster than . here is the total number of circuits in the array, generally arranged in an array. Brown used El Gamal's routing model for FPGAs and found good correspondence between it and FPGA interconnect requirements [Bro92]. For large numbers of gates () and , Donath finds that [Don79]. Together this means the channel width grows as . From which we can derive the interconnect requirements growth:
is often considered a good, conservative, value for to handle most interconnect requirements.
For , Donath finds that grows as or smaller. For , El Gamal's model suggests the the track width grows as . In this case, total interconnect requirements grow as .
The gates are recursively partitioned into equally sized sets at each level of the hierarchy. The principal interconnect occurs at each node of convergence in the hierarchy (See Figure ). At a level in the hierarchy, each node has a fan-in from below of signals and a fan-in from above of . Similarly, it has a fan-out of toward the leaves and towards the root. At each level , we have LUTs, external inputs, and external outputs. According to the hierarchical combining and Rent's Rule growth, we have:
We take , the number of LUT inputs. When , which will be true for small , we take -- that is, all outputs are passed out of the region when this Rent bandwidth permits.
Logically, we have distinct output directions from each node of convergence in the interconnect -- for the leaves, plus one for the root. Allowing full connectivity within each tree node, each of the leaves picks its inputs from the outputs from its siblings and from the inputs from the parent node. The outputs of this node are selected from the outputs from all subtrees converging at this point. Figure shows this basic arrangement for .
First, let us consider how wiring resources grow in this structure. At each stage of the hierarchy, there are wires coming and leaving each subarray. This makes the bisection width of . For a two-dimensional network layout, this bisection width must cross out of the subarray through the perimeter. Thus the perimeter of each subarray is . The area of the subarrays will be proportional to the square of its perimeter, making: The area required for each LUT based on wiring constraints, then, goes as:
Not unsurprisingly, this matches the interconnect growth we derived in Equation . Of course, if , wiring is not the dominant resource constraining LUT area. may be for as far as strict wiring requirements are concerned.
We can also look at the number of switches required if each of the logical switching units is a fully-populated crossbar. At each level, , the total number of switches is:
Amortizing across the number of LUTs supported at level , we can count the number of switches per LUT at each level:
Summing across all levels, we can thus calculate the number of switches per LUT as a function of the size of the network. Substituting for , and expanding sum:
For , this gives us:
For , each sum term in Equation goes to one:
For ,
Putting these cases, together:
Here, we see switching area per LUT grows as , for , and for . Again, this matches our wiring growth expectations (Equation ).
While Equation gives the correct growth rates it overestimates the required number of switches on two accounts:
In the previous section, we saw that the amount of interconnect we need to provide depends upon the connectivity of the network. This makes it difficult to design a single network which will efficiently accommodate arbitrary designs. If the design has limited connectivity, but the network provides a large amount of connectivity, the network is over designed relative to the design and provides less functional density than achievable. If the design has considerable connectivity, but the network provides less, the design must be routed sparsely on the interconnect, leaving many of the device LUTs unusable.
Using the switching models derived in the previous section, we can examine the relative inefficiencies of using a design with Rent exponent on a network with Rent exponent . We do this by looking at the ratio of the area occupied by a design with LUTs and on top of a network built using Rent exponent . If , then the ratio is simply the ratio of the area per LUT of a interconnect of LUTs to the area per LUT of a interconnect of LUTs. However, if , we cannot simply map the design netlist on top of the device LUTs. Here, we have to figure out how much larger the network must be than the number of LUTs in the design in order to accommodate the highly connected design. Let us call this scaling factor . In order for the network to accommodate the design, it must have enough i/o bandwidth into each subregion. Starting at the top level in the design, this means: The only way to accommodate this requirement with a fixed is to scale up the network used. Applying Rent's Rule (Equation ), this means: Solving this relation for equality:
Note that once we accommodate the top level of the design, all other levels are also accommodated as well. That is, once we have chosen as above, at the top level:
Since , at level , the connectivity required for the design will shrink faster than the network connectivity, so lower levels are satisfied by the same scale up factor which satisfies the top level in the design. The overhead ratio for the case, then, is the ratio of the size of a interconnect with Rent exponent compared to the size of an interconnect with Rent exponent .
In making this area comparison, we assume that switching area dominates non-switching area, and we approximate LUT area as proportional to the number of switches. From Section , we saw that this is true of conventional devices. In the previous section, we saw that switching requirements grow at least as fast as wires, and generally faster than non-switching resources. This suggests that switching area will continue to dominate non-switching area as device capacities grow.
If we solve for strictly according to Equation , the ratios are continuous and do not take into account the discretization affects associated with network size and levels. The continuous approximation gives us a smooth way to compare general overhead growth trends. Figure shows both the discrete and continuous comparisons for various implemented on a network with as a function of . Figure similarly shows the relative overheads for implementing designs with on designs. Figure plots the same data as the continuous case from Figure on three axes. Figure plots the continuous efficiency, the inverse of overhead, and Figure plots the continuous efficiency on a logarithmic scale.
Ideally, we would like to match the programmable network connectivity to the design connectivity. Unfortunately, we do not generally get that choice. Figures through show us that it is just as inefficient to provide too much interconnect for a design as it is to provide too little. This is important to notice, since there is a tendency to demand rich interconnect that provides high gate utilization across all designs. However, since the non-interconnect area is trivial compared to network area in FPGA devices, optimizing for gate utilization is often short sighted.
As a final, illustrative example, let us consider the task of picking the network connectivity, , assuming that we know typical designs will have a between 0.4 and 0.8. Figure shows the overheads for values of 0.4, 0.58, and 0.8 as a function of , respectively. If we further assume that the design Rent exponents are evenly distributed in this range, we can calculate an expected overhead: Figure plots this expected overhead for the identified range. We see that the expected overhead is quite flat between =0.5 and 0.6 with an expected overhead of just over 2. At the ends of the spectrum, the expected overhead is 8 worse. Note, in particular, if we chose to build in order to guarantee full utilization of every LUT, we would pay a 16 overhead on average, and a 56 overhead in the worst case. In contrast, choosing has a worst-case overhead of 4.2 and an average overhead of 2.2.
We can also ask how the requirements for interconnect description will grow. Trivially, we know that it will grow no faster than the number of switches composing the interconnect. However, it can actually grow much slower. We start (Section ) by using the full-connectivity model of the crossbar to establish an upper bound on the necessary interconnect description length. We then continue (Section ) using the Rent's rule based hierarchical interconnect, as in previous sections, to derive a tighter approximation. By either metric, we see that the instruction sizes for conventional FPGAs are significantly larger than necessary. This observation suggests that context memory area and reload instruction bandwidth can be significantly reduced by judicious coding (Section ).
Assuming that the network may be arbitrarily connected, we can count the number of possible interconnection patterns to get an upper bound on the number of interconnection bits which can be usefully employed describing the input to each LUT. We start by assuming we have a device composed of:
Since a LUT has inputs, the total number of interconnect bits needed is simply:
e.g. A 1000 4-LUT device with 200 inputs would require only 41 bits (44 if we encode each input separately) to specify each LUT's interconnect. A 9000 4-LUT device with 600 inputs requires only 53 bits (56 for separate input encodings).
If our functional elements are truly 4-LUTs, then this upper bound can be tightend by noticing that we gain no additional functionality by being able to route a particular source into the LUT multiple times and the assignment of the sources to LUT inputs is inconsequential. With this observation, we really only need to choose items from when specifying the LUT interconnect. This gives:
e.g. our 1000 4-LUT device with 200 inputs requires only 37 bits and our 9000 4-LUT device with 600 inputs requires only 49 bits. Here we save an additional 4 bits per 4-LUT. Asymptotically:
So, we expect that exploiting the equivalence of the inputs on a -LUT to save us bits from the number of bits required for full interconnect. For , this amounts to a savings of 4-5 bits per LUT.
Commercial devices are not purely composed up of LUTs, but we can draw a box around their basic programming elements and use the above counting arguments to get a loose upper bound on the number of interconnect programming bits they could require. Table shows parameters for each of several commercial device families along with a pedagogical reference. Using the parameters given in Table , we can use the full connectivity assumption to compute an upper bound on the network description length:
Table calculates for each of the device families from Table and contrasts these numbers with the number of actual device bits per basic element. The comparison is necessarily crude since vendors do not provide detailed information on their configuration streams. However, we expect the unaccounted control bits in Table to not be more than 10% of the total bits per block. With this expectation, we see that the commercial devices exhibit a factor of two to three more interconnect configuration bits than would be required to provide full, placement-independent, interconnect of the logic blocks.
The upper bound derived in the previous section assumed full connectivity of the network. However, the network is generally much more restricted. The restrictions imply a smaller class of realizable connection patterns and fewer requisite interconnect bits. In this section we return to our pedagogical, hierarchical interconnect from Section . For small Rent exponents, , we can derive tighter bounds.
Reconsidering, the hierarchical interconnect structure from Figures and , we can calculate the number of bits required per level of the hierarchy.
Table summarizes these values by level for , along with the number of bits according to the earlier crossbar and K-LUT equivalence calculations. This scheme also gives only an upper bound since the individual treatment of the permutations within each level counts more distinct combinations than actually exist. For moderate values of , though, this will give a tighter bound than the crossbar bound derived in the previous section.
The number of bits required will vary with the Rent exponent . Figure shows this variation. It also shows the relationship among choices of the arity of the hierarchy, , and the crossbar and K-LUT equivalence bounds. Figure shows the growth rate versus the number of LUTs for several Rent exponents and the two crossbar bounds.
Note that Donath performs a similar calculation in [Don74]. He uses a more restrictive interconnect model. Using 2- to 3-input, single-function gates, he calculates 7-10 bits of memory per to 0.8. In Donath's model, the required description bits does not grow with network size.
Resources for instruction storage and distribution, as the next Chapter (Chapter ) will address, can take up significant area and play a big role in the characteristics of an architecture. Notably, the size of the instruction determines the size of the instruction store on and off chip and the bandwidth required to load new instructions.
The bounds we derived in the previous sections show that the instruction sizes in traditional FPGAs are higher than necessary, at least by a factor of 2-4. For single context devices as we have seen, instruction memory makes up only a small fraction of the area on a conventional FPGA. For this reason, these bloated instructions do not adversely affect FPGA cell area (See Figure ). In fact, in wire limited regimes, they may help by localizing instruction bits to the values they control.
The most significant impact is on reconfiguration time. Smaller instructions mean we can reload instructions in less time, given the same bandwidth for instruction reload. Alternately, it means that correspondingly less resources can be dedicated to instruction distribution in order to achieve the same instruction reload time as the larger instructions.
As we begin to make heavy use of the reconfigurable aspects of programmable devices, device reconfiguration time becomes an important factor determining the performance provided by the part. In these rapid reuse scenarios, instruction size can play a significant role in determining device area and performance.
From the previous sections, we have seen that interconnect requirements grow faster than interconnect description requirements. Specifically:
This is one reason that single context devices can afford to use sparse interconnect encodings. Since the wires and switches are the dominant and limiting resource, additional configuration bits are not costly. In the wire limited case, we may have free area under long routing channels for memory cells. In fact, dense encoding of the configuration space has the negative effect that control signals must be routed from the configuration memory cells to the switching points. The closer we try to squeeze the bit stream encoding to its minimum, the less locality we have available between configuration bits and controlled switches. These control lines compete with network wiring, exacerbating the routing problems on a wire dominated layout.
So far, we have looked entirely at single-bit level granularity networks and designs. In this section, we look at how the multi-bit designs and networks effect the relations we have already developed.
In general, we will assume a -bit datapath with total bit processing elements. Groups of bit processing elements will act as a single compute node. We thus have such compute nodes.
We look at the wiring requirements, as we did before, by looking at the bisection bandwidth implied by the network. Assuming Rent's rule based hierarchical interconnect, at the top level we have i/o busses of width . This makes for a total bisection bandwidth: This makes the wire dictated area growth go as: Per LUT this makes:
Notice that this is actually larger than the interconnect wiring area required for single-bit interconnects (Equation ).
This result makes the assumption that the bits composing a node are tightly interconnected or otherwise coupled such that the minimum bisection occurs between tree levels as before. If the bits in a node were not interconnected in any way, the network could be decomposed into single bit networks. In such a case, the size would simply be times the size of an single bit network, making:
Equation implies the area is actually smaller than the single bit network for . In practice the earlier result (Equation ) is most realistic.
One issue this raises is that a -bit design with Rent exponent implemented on top of a single bit network will require more interconnect per level than a single bit design with Rent exponent . Using the same technique as in Section , we can solve for the required scale up factor:
Switching requirements, in contrast, diminish with increasing since less flexibility is required of the network with a given number of bit processing elements. We can derive the switching requirements by substituting in for in Equation then multiplying by the datapath width:
For large , wiring requirements will asymptotically dominate switching requirements in a bussed interconnect scheme.
This section focussed on interconnect. We established via an empirical review that interconnect makes up the dominant area and delay in conventional FPGAs. We then went on to look at network design issues. We established basic relations governing the interconnect requirements in terms of network size and wiring complexity. In the processes we showed that it is not always best to build the network with sufficient interconnect to accommodate the most heavily interconnected designs at full gate utilization; rather, since interconnect is the dominant area contributor, more efficient area utilization can be achieved with networks with lower interconnect complexity. We also looked at interconnect description requirements, noting that interconnect descriptions grow more slowly than wire and switching requirements. Here, we pointed out that conventional devices use sparse interconnect description encodings, using, at least a factor of 2-4 more configuration bits than necessary; this observation suggests that we have an opportunity to reduce the area required to hold descriptions in multicontext devices and the bandwidth required for configuration reload in single or multicontext devices. Finally, we looked at how the interconnect size relationships change with wider word operations and saw that greater word widths increase wiring requirements while decreasing switching requirements.