Transit Note #121

Notes on Programmable Interconnect

Andre DeHon

Original Issue: February, 1995

Last Updated: Sat Apr 8 21:42:45 EDT 1995

Introduction

This is a collection of notes on programmable interconnect design for reconfigurable devices. These notes were collected in the context of DPGA design (tn95) (tn114) (tn112), but are mostly relevant to FPGA design in general.

Section starts by describing the basic primitives used for building reconfigurable interconnect. Section describes two extremes in reconfigurable interconnect, and Section reviews the interconnect used in the DPGA prototype as an example. Section formulates a general set of design goals for reconfigurable interconnect. Section builds on the concepts detailed here to formulate a general interconnection strategy for programmable devices.

Primitives

At the base of the network topology we consider two main interconnect primitives: multiplexors and crossbars.

Multiplexors -- Multiplexors select a signal from one of several input signals. Here, several lines in the programmable interconnect are brought into the data inputs of the mux and the context configuration drives the selector input to the mux as shown below. The configuration memory has multiple contexts as with the array element (gate) memory (See Figure ).

It may be instructive to compare this primitive to the LUT (lookup table) used for the logical gate primitive in many reconfigurable devices ( e.g. [Xil93] [BCE +94]). In the gate case, the inputs (from the programmable network) connect into the selector and the configuration (from the context memory) connects into the data inputs of the mux. In the interconnect case, functions are swapped with the inputs driving into the data inputs of the mux, and the configuration memory driving into the selector inputs.

Crossbars -- A crossbar is a basic interconnect primitive with a set of inputs and a set of outputs which can connect any of its outputs to any of its inputs (See Figure ). One could, for instance, build an ( inputs, outputs) crossbar by taking , -input mux primitives as introduced above and wiring the inputs on each of the muxes together.

Other programmable interconnect primitives exist, such as:

These primitives are, in fact, often the basis for building the earlier (multiplexors and crossbars). However, their direct use is less favored in multiple-context network architectures. The reason is one of efficient memory/configuration bit utilization. In particular, the muxes and crossbars have a number of legal configurations which is the same as the number of configurations which their programming describes.

For example, consider the crossbar. It contains switches. If a memory cell were dedicated to each switch, it would require memory cells per context. However, if we restrict the sensible configurations to those which connect each output to exactly one input, the legal space of configurations is only . If the crossbar primitive is built out of the mux primitive as suggested above, we require exactly memory cells per context. This savings in configuration memory is, of course, of increasing importance as we go to larger numbers of on-chip contexts.

It is worth noting that, even with crossbars and multiplexors, one can build systems where multiple, ``distinct'' configurations are functionally equivalent. For these reasons, one needs to consider the entire system and what configurations are logically distinct. Consider, for example, using an crossbar to select the 4 inputs to a 4-LUT. Since all the inputs to the 4-LUT are logically equivalent (assuming we have freedom to fully program the LUT memory), the order of the input signals into the 4-LUT does not matter. That is, all combinations of a given subset of 4 inputs will be equivalent. It, therefore, does not make sense to use a crossbar in this case. In fact, the crossbar has configurations whereas only combinations are distinct in this scenario. As a consequent, it is generally better to use multiplexors for the final input selection into a LUT than using a crossbar.

The key idea here is to match the information content stored in configuration memory to the information content required to configure all interesting and distinct configurations of the programmable device. In particular, if a device can be configured in legal and useful ways, it should only require bits of memory to specify its configuration.

Understanding Network Interconnect -- Two Extremes

Now, let's look at the first two schemes for network interconnect which may come to mind:

  1. Full Crossbar
  2. Nearest Neighbor
While both of these have their places, both scale poorly and, alone, are poor choices for the size of arrays of general interest.

Full Crossbar -- In the most extreme case of full connectivity, we could build a full crossbar to connect every array element input to every array element output. In such a scheme we would not have to worry about whether or not a given network could be mapped onto the programmable interconnect -- however, the cost for this full interconnect is prohibitively high. Further, this scheme is inefficient due to the logically equivalent configuration problem introduced above.

For an element array where each element is a -input function ( e.g. -LUT), the crossbar would be a crossbar. This requires memory bits for configuration. Arranged in a roughly square array, each input and output must travel distance, before we account for saturated wire density. Since interconnect delay is proportional to interconnect distance, this implies the interconnect delay grows at least as . However, the bisection bandwidth for this crossbar . For sufficiently large , this bisection bandwidth requires that the side of an array be to accommodate the wires across the bisection. In turn, the bisection bandwidth dictates an area and input and output runs of length . For large crossbars, then, wire size dominates the active logic and the configuration space.

Full crossbars make some pragmatic sense in the region where the interconnect requirements (which grow as ) and configuration requirements (which grow as ) do not dominate the logical array size (which grows as ). In this region where size is dominated by the array element logic, interconnect delays grow as , so size is still balanced with the mandatory interconnect delay.

Nearest Neighbor or mesh -- At the opposite extreme, we can use only local connections within the mesh between adjacent (or close) array elements. By limiting all the connections to fixed distances, the link delay does not grow as the array grows. Further, the bisection bandwidth in a mesh configuration is and hence, never dominates the logical array element size. However, communicating a piece of data between two points in the array requires switching delay proportional to the distance between the source and the destination. Since switching delay through programmable interconnect is generally much larger than fanout or wire propagation delay along a fixed wire, this makes distant communication slow and expensive. For example, in a topology where direct connections are only made between an array element and its north, east, south, and west neighbors (typically called a NEWS network -- See Figure ), a signal must traverse a number of programmable switching elements proportional to the Manhattan distance between the source and the destination (). For the interconnect network topologies typically encountered in logic circuits, this can make interconnect delay quite high -- easily dominating the delay through the array element logic.

Example: Two-level Interconnect on DPGA Prototype

The DPGA prototype (tn114) (tn112) uses a two-tiered interconnect scheme. The array elements are organized into small arrays (subarrays) which are, in turn, tiled to build the entire array. The subarray acts as the leaf of the interconnect hierarchy. Each array element output is run vertically and horizontally across the entire span of the subarray. Each array element can, in turn, select as an input the output of any array element in its subarray which shares the same row or column (See Figure ). This topology allows a reasonably high degree of local connectivity without incurring the delay costs of switching through additional interconnect.

In addition to the local outputs which run across each row and column, a number of non-local lines are also allocated to each row and column. The non-local lines are driven by the global interconnect. Each LUT can then pick inputs from among the lines which cross its array element (See Figures and ). In the prototype, each row and column support four non-local lines. Each array element can thus pick its inputs from eight global lines, six row and column neighbor outputs, and its own output. Each input is configured with an 8:1 multiplexor taping from a subset of the 15 lines into each cell for the reasons suggested above. Of course, not all combinations of 15 inputs taken 4 at a time are available with this scheme. The inputs are arranged so any combination of local signals can be selected along with many subsets of global signals. Freedom available at the crossbar in assigning global lines to tracks reduces the impact of this restriction.

Between each subarray a pair of crossbars route the subarray outputs from one subarray into the non-local inputs of the adjacent subarray (See Figure ). Note that all array element outputs are available on all four sides of the subarray. In our prototype, this means that each crossbar is a 168 crossbar which routes 8 of the 16 outputs to the neighboring subarray's 8 inputs on that side. Each 168 crossbar is backed by a 432 memory to provide the 4 context configurations. Each crossbar output is configured by decoding 4 configuration bits to select among the 16 crossbar input signals.

While the nearest neighbor interconnect is minimally sufficient for the 33 array in the prototype, a larger array should include a richer interconnection scheme among subarrays to avoid incurring a large switching delay when routing between distant subarrays.

See [BCE +94] for more details of the prototype interconnect.

Design Goals for Programmable Interconnect

After seeing these examples, we can begin to formulate the design requirements for our programmable interconnect:

  1. Provide adequate flexible -- The network must be capable of implementing the interconnection topology required by the programmed logic network with acceptable interconnect delays.
  2. Use configuration memory efficiently -- Space required for configuration memory can account for a reasonable fraction of the array real-estate. Configuration bits should be used efficiently (as discussed above).
  3. Balance bisection bandwidth -- As discussed above, interconnection wiring takes space and can, in some topologies, dominate the array size. The wiring topology should be chosen to balance interconnect bandwidth with array size so that neither strictly dominates the other.
  4. Minimize delays -- The delay through the routing network can easily be the dominate delay in a programmable technology. Care is required to minimize interconnect delays. Two significant factors of delay are:
    1. propagation and fanout delay -- Interconnect delay on a wire is proportional to distance and capacitive loading (fanout). This makes interconnect runs roughly proportional to distance run, especially when there are regular taps into the signal run. Consequently, small/short signal runs are faster than long signal runs.
    2. switched element delay -- Each programmable switching element in a path (crossbar, multiplexor, etc.) adds delay. This delay is generally much larger than the propagation or fanout delay associated with covering the same physical distance. Consequently, one generally wants to minimize the number of switch elements in a path, even if this means using some longer signal runs.
    Switching can be used to reduce fanout on a line by segmenting tracks, and large fanout can be used to reduce switching by making a signal always available in several places. Minimizing the interconnect delay, therefore, always requires technology dependent tradeoffs between the amount of switching and the length of wire runs.

General Formulation

Following from the above, we can formulate a general family of interconnection networks which looks most promising. This family has multiple levels of interconnect like the prototype. Unlike the prototype, they may have more than one level of non-local interconnect. Several viable interconnect topologies exist for each of the levels of interconnect and their composotion.

Local interconnect

At the base, of the interconnect hierarchy, there are local connections among array elements. These are generally worth keeping in any scheme because:

How far local interconnect spans is, of course, highly dependent on the technology costs ( e.g. how expensive is switching delay versus propagation delay? how much space do configuration bits require? how big are wires relative to fixed logic and memory?) and the detailed choice of a local interconnect scheme.

Some viable local interconnect schemes include:

  1. Nearest neighbor -- As shown above, local interconnect may be limited to the adjacent NEWS neighbors. In some variants diagonal neighbors may be included, as well.
  2. Regular neighborhood -- This is a generalization on nearest neighbor. An array element's outputs are arranged to span more than one array element in each direction (they need not even span a uniform number of array elements in each direction). Figure shows an example of the regular neighborhood interconnection topology.
  3. Bounded subarray -- The prototype was an example of this genera. Many topologies which do not make sense when scaled up to large arrays, are perfectly viable, even preferable, for small arrays. The goal here is to use these topologies in the domain where they do make sense. The prototype ran all array element outputs across the subarray row and column and allowed each array element to take as input anything in the same row or column.
  4. Crossbar -- This is really a special case of the bounded subarray. For a sufficiently small collection of array elements, it may make sense to fully connect them using a crossbar (perhaps modifying it slightly from a full crossbar to avoid some of the obvious, equivalent configurations identified above). Altera's Flex 8000 line, for example, provides full crossbar interconnect for ``logic array blocks'' of eight array elements [Alt93].

Non-local interconnect

Beyond the local interconnect, we need to wire up logical elements which must be placed further apart and provide for larger fanout. This is best done by integrating additional levels of network on top of the local interconnect.

The first issue to address is interfacing between the local and non-local interconnect. To receive non-local signals, the local network usually allocates some input channels to routed, non-local signals. For instance, a local crossbar scheme might increase the number of inputs on the crossbar to receive several non-local signals in addition to the local signals. In the prototype, the local multiplexors which delivered inputs to each LUT tapped off of both local and non-local tracks which span each column. In general, the non-local interconnect would arrange to switch a few non-local signals into the local network input and the local network would take care of routing these signals to the array elements themselves.

Conversely, we need to get the array element outputs into the non-local network. Typically, this is done by selecting an interesting subset of the local signals to concentrate into the non-local network and is often integrated with the first-level of the non-local interconnect. In the prototype, all the local outputs from a subarray were brought into a crossbar which then selected a subset and switched them into the non-local inputs to the adjacent subarray.

The topology for the non-local network is likely to draw upon many of the basic topologies we've already introduced. At the same time, it is likely to be organized into a hierarchically structure where a series of non-local networks are layered on top of each other, each spanning a successively larger portion of the array.

A given level of the non-local interconnect, then might be:

  1. Nearest neighbor -- We saw that nearest neighbor interconnect was used for the non-local network for the prototype built. We also saw that switching delay generally makes this, alone, suboptimal for large arrays. There are, however, hierarchical schemes which get around many of the problems of nearest neighbor interconnect. Typically, a non-local, nearest neighbor network would concentrate some subset of the local (or lower-level) inputs into the nearest neighbor switching network, allow adjacent connections, and allow connections back to lower levels of the network (or into the local network). For example, each block of elements or nodes might make up one node in the next level of the network. This organization can be continued until a level is reached where the whole array of elements is one node. This forms a tree structured interconnect. The number of switching nodes which must be traversed between distant nodes in this network is now rather than .

    Of course, the bandwidth handled in the successive stages of the network must be carefully chosen. If the bandwidth is chosen too high, the higher level network will occupy more space than can be fruitfully employed. If chosen too low, it won't be possible to route all the desired, long-distance connections using the hierarchical interconnect.

    Since they are layered on top of each other in 2-space on the IC die, the bisection bandwidth from all levels composes to determine the bisection bandwidth of the device. If a geometric scheme is chosen where each level in the hierarchy supports a bisection bandwidth which is a fraction of the total bisection bandwidth of its constituency, the total bisection bandwidth is only a constant multiple of the base bisection bandwidth.

    e.g. Consider the nearest-neighbor mesh with 4 connections in each compass direction. Now consider that each group of neighbors is grouped into a level 1 node, each group of level 1 nodes are grouped into a level 2 node, and so on. Further, assume the aggregate level-n bandwidth between nearest neighbors is one-half the total bandwidth at lower levels. So, the level-1 nodes have nearest neighbor connections, and the level-2 nodes have nearest neighbor connections. Viewing this in an amortized sense at each array element, the total bandwidth in each compass direction is or, , in the limit of infinitely many levels of interconnect.

    In fact, in general, if the bandwidth between nodes at level in the hierarchy is one -th the total bandwidth of the contained level interconnect between nodes, the total bisection bandwidth will be times the base bisection bandwidth. Note that this result is independent of the size of the array, so this topology scales nicely to larger arrays.

    At upper levels of the network, the nearest neighbor network should provide corner turns and selective spanning, neither of which were necessary for the local nearest-neighbor connections.

    While not identical in formulation or breadth, this topology captures the essence of mesh with bypass interconnect, segmented channel routing, and express cubes [Dal91].

  2. Regular Neighborhood -- Since this is a generalization on nearest neighbor, the same observations apply here. Higher level of the network can encompass successively larger neighborhoods. All the issues of hierarchy, bandwidth, and function raised for nearest neighbor interconnect apply here, as well.

  3. Bounded subarray -- Here, too, we can repeat the subarray strategy at successive levels in the hierarchy. Each successive level can span a geometrically larger subarray. Using the subarray strategy in the prototype, connections are routed across each row and column in the subarray node. The propagation delay will grow with the subarray size due to physical distance, but the loading fanout need not -- that is, each subarray can service an equal number of lower-level subarrays, each of which presents a single load to the higher-level array. As with the other schemes exploiting hierarchical interconnect, the switching delay need only be . The subarray will also want to support corner turns and may want to support nearest neighbor subarray connections, as well.

  4. Crossbar -- A full crossbar can be viably employed in a hierarchical, subarray scenario. As before, the trick is to limit the size of the crossbar to a region where its spatial and temporal requirements are favorable. A hierarchical subarray of crossbars is essentially a fat tree [GL85] [Lei85] (See Figure ). See also [AL94] for some development of crossbar-based, hierarchical FPGA interconnect.

  5. Fat-pyramid -- The pure tree structure sometimes provides a poor match to physical locality (strict subarrays don't have nearest neighbor connections which span the edges of the subarray). Fat pyramids combine the nearest neighbor connections of the mesh with the more crossbar-like interconnect which gives us the fat-tree topology [Gre90].

For simplicity, I have described composite networks which are homogeneous in both organization and style. However, that need not be the case. The node size or bandwidth may want to change between levels within the same style, and it may be beneficial to mix styles. For example, local, crossbar subarrays may be interconnected by regular neighborhoods of interconnect which are in turn connected by nearest neighbor connections.

Pipelined Interconnect

Since switching delays in the interconnect can be comparable to logic delays, it will often make sense to pipeline the transmission of data through long interconnect. In particular, it may be useful to use the array element logic plus local interconnect delay as a base cycle time. When connections need to traverse additional, hierarchical interconnect, this will entail larger delay. By pipelining across these interconnect, we can operate at the base cycle time despite the larger latency in the interconnect. Here it will make sense to group a number of switching stages together whose delay is comparable to the array element logic plus local interconnect delay and treat each such group as a separate pipeline stage. Programmable or manditory flip-flops associated with the switching network will be needed support the pipelined interconnect discipline.

When we include the multiple context configurations which switch the interconnect, we can also time-multiplex each piece of interconnect. This works especially well with the pipeline interconnect. Cycled at the base cycle time the interconnect can pipeline multiple bits of data, potentially destined for different destinations, across any individual wire during the switching latency required to get a bit between distant portions of the device. This time-multiplexed wire use allows one to extract the full bandwidth of the wires on the programmable device, allowing much higher utilization than when a single signal is statically associated with each wire in the programmable interconnect. Note that this exploits the same basic characteristics as Virtual Wires [BTA93], but inside the programmable device.

External I/O

Since the delay, density, and area constants associated with off-chip interconnect is substantially different from on-chip interconnect ( i.e. it is much slower, less dense, and larger), it often makes sense to handle device i/o specially in the network. To avoid periphery pin limitations, it may make sense to drive inputs directly into higher level, long-spanning interconnect in the array rather than simply into peripheral, nearest neighbor connections. Similarly, output connections may want to come from higher, coarser levels in the hierarchical interconnect rather than from local interconnect. Further, techniques like time-multiplexing the i/o pins to maximize their bandwidth ( e.g. Virtual Wires) are generally beneficial.

See Also...

References

AL94
Aditya A. Agarwal and David Lewis. Routing Architectures for Hierarchical Field Programmable Gate Arrays. In Second International ACM/SIGDA Workshop on Field-Programmable Gate Arrays. ACM, February 1994. proceedings not available outside of the workshop, contact author lewis@eecg.toronto.edu.

Alt93
Altera Corporation, 2619 Orchard Parkway, San Jose, CA 95314-2020. Data Book, August 1993.

BCE +94
Jeremy Brown, Derrick Chen, Ian Eslick, Edward Tau, and Andre DeHon. A 1 CMOS Dynamically Programmable Gate Array. Transit Note 112, MIT Artificial Intelligence Laboratory, November 1994. [tn112 HTML link] [tn112 PS link].

BDK93
Michael Bolotski, Andre DeHon, and Thomas F. Knight Jr. Unifying FPGAs and SIMD Arrays. Transit Note 95, MIT Artificial Intelligence Laboratory, September 1993. [tn95 HTML link] [tn95 PS link].

BTA93
Jonathan Babb, Russell Tessier, and Anant Agarwal. Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators. In First International ACM/SIGDA Workshop on Field-Programmable Gate Arrays, 1993.

Dal91
William J. Dally. Express Cubes: Improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9):1016-1023, September 1991.

GL85
Ronald I. Greenberg and Charles E. Leiserson. Randomized Routing on Fat-Trees. In IEEE 26th Annual Symposium on the Foundations of Computer Science. IEEE, November 1985.

Gre90
Ronald I. Greenberg. The Fat-Pyramid: A Robust Network for Parallel Computation. In William J. Dally, editor, Advanced Research in VLSI: Proceedings of the Sixth MIT Conference, pages 195-213, Cambridge, Massachusetts, 1990. MIT Press.

Lei85
Charles E. Leiserson. Fat-Trees: Universal Networks for Hardware Efficient Supercomputing. IEEE Transactions on Computers, C-34(10):892-901, October 1985.

TEC +95
Edward Tau, Ian Eslick, Derrick Chen, Jeremy Brown, and Andre DeHon. A First Generation DPGA Implementation. Transit Note 114, MIT Artificial Intelligence Laboratory, January 1995. [tn114 HTML link] [tn114 PS link].

Xil93
Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. The Programmable Logic Data Book, 1993.

MIT Transit Project