Entropy, Counting, and Programmable Interconnect
Andre DeHon
Original Issue: September, 1995
Last Updated: Thu Nov 30 22:09:56 EST 1995
Conventional reconfigurable components have substantially more interconnect configuration bits than they strictly need. Using counting arguments we can establish loose bounds on the number of programmable bits actually required to describe an interconnect. We apply these bounds in crude form to some existing devices, demonstrating the large redundancy in their programmable bit streams. In this process we review and demonstrate basic counting techniques for identifying the information required to specify an interconnect. We examine several common interconnect building blocks and look at how efficiently they use the information present in their programming bits. We also discuss the impact of this redundancy on important device aspects such as area, routing, and reconfiguration time.
Despite the fact that programmable devices ( e.g. FPGAs) are marketed by the number of gates they provide, a device's interconnect characteristics are the most important factors determining the size of the programmable device and its usability. When designing the interconnect for a programmable device we must simultaneously address several important, and often conflicting, requirements:
Fundamentally, interconnect description memory need not grow as quickly as wire and switch requirements. This leaves us with design points which are generally wire and switch limited. We may, consequently, encode our interconnect description sparsely when memory area is nearly free and wiring channels are determining die size. Since interconnect memory grows slowly compared to switch and wire requirements, interconnect memory need never dictate die size for large, single-context, reconfigurable components.
As we begin to make heavy use of the reconfigurable aspects of programmable devices, device reconfiguration time becomes an important factor determining the performance provided by the part. In these rapid reuse scenarios, interconnect encoding density can play a significant role in determining device area and performance.
This paper starts out by identifying a simple metric for characterizing interconnectivity -- a count of the number of ``useful'' and distinct interconnection patterns. While simple, this metric turns out to be difficult to calculate in the general case. We can, nonetheless, analyze a number of common structures to obtain bounds on the number of patterns provided by more complicated topologies. From this analysis we observe that conventional, programmable architectures have highly redundant programming bit streams. The analysis helps us identify opportunities to save programmable memory area and lower reconfiguration time by reducing that bit stream redundancy.
In the next section, we open with a motivational example to illustrate the
issue and analysis. In Section , we define the interconnection
metric more precisely. In Section
, we look at some basic
interconnection building blocks which can be easily analyzed in isolation.
Section
looks at some ideal or pedagogical network structures by
composing the primitives from Section
. Section
touches briefly on the role of placement in interconnect flexibility.
Section
looks at conventional interconnect examples using the
results of Section
to estimate the amount of redundancy these
devices exhibit. We discuss the implications and impact of this redundancy
in more detail in Section
before concluding in
Section
.
Consider a case where we wish to drive any of sources onto each of
sinks. In our DPGA prototype [TEC +95a], for example,
we needed to drive the 4 inputs to the 4-LUT from the 15 lines which
physically converged upon the LUT cell (
,
, See
Figure
). This kind of structure is typical when
connecting logic block inputs to a routing channel.
We could provide full connectivity by building a full crossbar between the
sources and
sinks. This requires
, 60 in this
case, crosspoints. If each crosspoint is built with its own memory cell,
as shown in Figure
, this arrangement entails equally
many memory cells.
At this point it is worth noting that there are
ways to program 60 memory cells, but many of those connections are not
useful. In fact, of the
memory cells along each input wire, at most
one should enable its crosspoint at any point in time. Since each output
should be driven by one source, there are, in fact, only
,
in this case, combinations which may be useful. We can, in
fact, control the
crosspoints with only
memory cells, or 16 in this case.
Here, each group of
bits serves to
select which of the
inputs should drive onto each of the
output
lines (See Figure
).
If, as is the case for the DPGA, the logic block is a -input lookup
table, even this arrangement provides more routing flexibility than is
necessary. There is, for instance, no need for any of the
inputs to be
the same. Strictly speaking there are at most
(in this example,
) distinct selections of the
inputs from the
sources. This implies we could get away with as few as 11
(
) control bits if we included heavy
decoding.
From this calculation we can also conclude that any combinations provided beyond the 1365 distinct interconnect patterns entailed by the 15 choose 4 combinations are redundant and offer us no additional functionality. This bound is often useful in understanding the gross characteristics of an interconnect. All interconnects which provide all these combinations are functionally equivalent. Above this point we can compare the number of specification bits to the number of requisite specification bits to understand the redundancy in the encoding scheme. For interconnects providing less distinct interconnect patterns, we can compare the number of achievable interconnect patterns to the maximum number of distinct interconnect patterns to get a scalar metric of the flexibility of the interconnect.
In practice, the DPGA prototype used 4, 8-input muxes. This gave it only
32 crosspoints requiring bits to specify the
input. Due to the limited size input muxes, it only supported 1217
distinct input combinations. Our encoding is within one bit of the
theoretical minimum of 11 and routes 89% of the 1365 combinations
identified above. An interesting, pragmatic variant
providing full interconnect is detailed in Appendix
.
With respect to memory bits, we see that dense encoding of the combinations yields a more significant reduction in requisite memory cells than depopulating the crossbar in a memory-cell-per-crosspoint scenario. In an unencoded case, 12 memory bits would not even support a single connection from each of the 15 potential inputs.
Of course, minimizing the number of memory bits does not,
necessarily, produce the smallest layout since dense encodings must be
decoded and routed. We will visit this issue in Section .
This example underscores several issues which are worthwhile to understand when designing programmable interconnect.
The interconnection metric we are focusing on here is to count: the number of unique and useful connection patterns which the network can realize.
With sparse encoding, many bit combinations may be parasitic or non-useful. In the ``bit-per-crosspoint'' example above, we saw that it was not useful to have more than one crosspoint enabled along an output line. In these cases, a large portion of the entire bit stream encoding space refers to interconnection patterns which are at best nonsensical and may even be damaging to the part.
An interconnection pattern which provides the same functionality as
another specifiable pattern adds no capability to the interconnect. That
is, the flexibility which allows both specifications provides no additional
interconnect power. In the example above, we saw that a full crossbar
provided specifiable connections. However, with the
connections feeding into a 4-LUT, a pattern which bring inputs 15, 13, 10
and 7 into the LUT is no different from a pattern which brings inputs 10,
15, 7 and 13 into the LUT. The full crossbar thus provides
redundant patterns which add no functionality to the interconnect.
It is worthwhile to note that interconnect ``flexibility'' as used here can be applied to any interconnection network or family of graphs. As such it is very different from the interconnect flexibility defined in [RB91] [BFRV92] which is used to describe the level of population of switches in a particular interconnect family.
It is important that we look at the interconnection patterns as a whole
to understand which patterns are functionally identical.
Again, looking at the -LUT input selector in isolation from the logic
block, we can identify more distinct patterns than we see when viewed in an
ensemble.
It is also significant that the metric looks at the ensemble of
interconnections feasible rather than looking at the interconnect
flexibility afforded to a single input in isolation. Since resources are
typically shared, the focus on patterns accounts for the limited
availability of shared resources. For example, consider a shared,
non-local interconnection line as shown in the interconnect fraction
depicted in Figure . Looking at a single input, we see
that the input can come from any of the
local inputs or any of the
non-local inputs. Since the shared line is made available to all LUT
inputs, any input could use this line to access the non-local inputs.
However, since there is only a single non-local input line, it is not
possible to bring more than one non-local line into the LUT at any point in
time.
In this section we identify a few basic primitives used in building interconnection networks. We count the number of interconnection patterns they provide, as well as relating the abstract primitives to their more physical implementations.
An -input multiplexor (mux) can connect any of its
inputs to its
output. In isolation, the multiplexor realizes
distinct
interconnect patterns, requiring
bits
to specify its behavior.
Figure shows a multiplexor and several possible
implementations. Note that a series of tristate drivers attached to a
single, physical wire serves logically as a multiplexor and can be treated
as one for the purposes of analysis. This remains true whether the
tristate drivers have separate or central control. The series of tristate
drivers with distributed control has the additional ability to drive no
value onto the line, but this offers no additional logical functionality
since we can just as easily drive any value onto the output in such a case.
The tristate drivers could also have multiple values driven simultaneously
onto the output, but, unless this is used to perform a logical function,
the multiple drive cases are parasitic rather than useful behavior.
A crossbar has a set of inputs and a set of
outputs and can
connect any of the inputs to any of the outputs with the restriction that
only one input can be connected to each output at any point in time.
Logically, an
crossbar is equivalent to
-input
multiplexors. The crossbar can realize
distinct interconnect
patterns and requires
bits to
specify its behavior. The crossbar represents the most general kind of
interconnect and can often be used to calculate an upper bound on the
flexibility which could be offered by a piece of interconnect.
A small crossbars is shown in Figure . We have already seen
some potential implementations in Figures
and
.
With subset selection, we select a group of outputs from
inputs.
We saw this selection when routing the
inputs to a
-LUT. This kind
of selection is also typical when concentrating outputs from one region
down to a limited size channel connecting to another region of the
component. As we saw in Section
, this function
is not as trivially implemented as the crossbar, but the function is
frequently desired making this primitive highly useful for analysis.
Subset selection entails
distinct interconnection patterns and requires
bits to
specify its behavior.
Individual pass gates are too primitive to generally be useful from an analysis standpoint. Alone, each pass gate has 2 interconnection states which can be specified with one bit. From an analysis standpoint it is generally more useful to group collections of pass gates together into a larger structure. As noted above, when the direction of signal flow is apparent, groups of pass gates used to selectively drive onto a single line effectively form a logical multiplexor.
In this section we apply and compose the primitives of the previous section to examine a few families of netlists and networks of general interest.
Let us consider the family of netlists with logic blocks where each
logic block has at most
inputs. Further, the netlist has
inputs and
outputs. To get an upper bound, we assume all logic blocks may provide
a distinct function. In general, each of the logic block inputs may want
any of the
logic block outputs or any of the
inputs. Each output
may come from any of the
logic block outputs. There are
possible interconnect patterns for
the
logic block inputs and
interconnection
patterns for the outputs. All together, there are
interconnection patterns which the netlist
may exhibit. Without exploiting placement (See Section
), if all
interconnects are given equally long descriptions, it will require at least
bits to describe each
interconnection patterns.
In a similar manner, we can look at a device or interconnect and bound
the number of interconnection patterns it provides. Again, assuming we
have logic blocks with
inputs each along with
inputs and
outputs, if we do not allow inputs to be directly connected
to outputs, we have a total number of interconnection patterns of, at most:
If inputs can be directly connected to outputs, we have
possible output patterns for a total number of interconnection
patterns bounded by:
For many devices, each inter-chip i/o pin can
serve as either an output with enable or an input. We thus let ,
assuming we may have to route both the output and the enable for each
physical pin, and have a total of at most
interconnection patterns. The interconnect can thus be
specified with
configuration bits.
From this and the previous example, it is worth observing that network
configuration requirements are growing as --- assuming
is fixed and
grows at most linearly in proportion to
. We can
compare this to the configuration requirements for the logic which are
growing as
. We can also compare this with the number of crosspoints
or total length of interconnect wire which is growing as
if we
are going to provide the full interconnect to support any of the possible
-node netlists.
These general observations, summarized in Table , are
worthwhile to remember. Asymptotically, they tell us wires and crosspoints
will be the pacing items for offering flexibility in programmable devices.
They also tell us to expect interconnect programming to grow faster than
logic cell programming. We can describe any interconnect much more
cheaply than we can spatially route it, underscoring the need to
reuse physical crosspoints and wires in time as we build larger
programmable devices and systems.
If a device's network is built entirely out of -LUTs, we can get a
slightly tighter bound on the number of interconnection patterns provided.
Again, our source of inputs is the
LUT outputs and the
inputs.
Each LUT must choose
inputs from
sources. Together, this makes
for
total
interconnection patterns. If we also
assume
and outputs can come directly from inputs, the total number
of interconnection patterns is at most:
The difference between this case and the previous is that the network, sans
output connections, has at most patterns rather than
.
The ratio between these two expressions is:
For large
and small
,
.
This ratio is then roughly
. In terms of configuration
bits, this amounts to
--- a small savings linear in
for
fixed
. e.g. for
, one can save at most 4-5 network
configuration bits per LUT by exploiting the input pin equivalences.
Referring back to Table
, this tells us that exploiting LUT
input equivalences saves us
configuration bits, but does not,
fundamentally, change the growth rate for interconnect configuration.
disjoint_networks.tex
limited_bisection.tex
Once we identify a level of desired interconnect flexibility, it is not
necessary for the physical interconnection network to solely provide that
flexibility. In programmable devices, we also have freedom in where we
place functions and results within the interconnection network. The net
result of this freedom is that we can achieve a given flexibility
with a network which provides fewer than
interconnection patterns. This also means we could, in theory at least,
use less than
bits to specify the
interconnect pattern when we observe that the placement of functions and
results relative to the network also gives us a degree of specification
freedom.
To make this concrete, let us consider a very simple interconnection
problem where we wish to interconnect 4 2-input LUTs (,
).
We know there are
possible interconnection networks for this small example.
First, we consider a limited interconnection network where we arrange the
four LUTs into a array. Each LUT has one input which may come
from either LUT in the first column and a second input which may come from
either LUT in the second column (See Figure
). Each of the
inputs in this restricted interconnect can come from one of
two places, making for a total interconnect flexibility of
. It
is also worthwhile to note that all 256 interconnect combinations are
distinct. If we had fixed the placement of the 4 logic functions in the
array, then we could only realize these 256 interconnect patterns.
Allowing the functions to be arranged within the array allows greater
flexibility. We know there are
ways to place 4 logic functions in
the
array. Because of the high symmetry of the physical
network, it turns out that groups of 8 permutations are equivalent with
respects to routing. As a result there are 3 distinguishable placement
classes. This could provide us with at most
interconnection patterns if all of the permuted interconnects were
distinct. In practice, this gives us 720 of the 1296 possible patterns.
Using an alternate interconnection scheme with less symmetry, shown in
Figure , we can get 6 distinguishable placement classes
and achieve 1104 of the 1296 possible patterns. The 1104 interconnection
patterns are, of course, over a factor of four more patterns than the 8
bits of interconnect programming could specify, alone.
This example underscores the fact that placement in asymmetric networks allows us to expand the number of realizable networks for a given, limited, physical interconnect. Additionally, we see that successfully routing networks in this scheme requires picking the correct permutation for the network (placement) and then the correct interconnection pattern (routing). Asymmetry in the network can have the effect of both increasing the flexibility, by making more interconnection patterns realizable, and of making the routing problem harder, by offering a larger space of distinct placement classes to explore during routing. In general, the extent to which placement can reduce the demand for interconnect resources, including wires, switches, and configuration bits, remains an open issue.
Table summarizes the major characteristics for several
contemporary programmable devices. Additionally, a pure 4-LUT design is
included for sake of comparison with the industrial offerings. We can make
a rough, back-of-the-envelope-style computation on the bits required to
configure the network by making the assumption that the network does
support all potential networks without exploiting placement flexibility.
That is, we assume that the basic logic blocks are fully connected. Since
the conventional offerings are far from being fully interconnected, this
gives us an upper bound on the number of network configuration bits.
Adapting Equation
and taking the base two logarithm to
convert to bits, we get:
We can also calculate the number of bits required to specify the logic block functions in the obvious manner:
Table summarizes the results of these basic
calculations for the identified components.
The comparison is necessarily crude since vendors do not provide detailed
information on their configuration streams. However, we expect the
unaccounted control bits in Table to not be more than
10% of the total device programming bits. With this expectation, we see
that these devices exhibit a factor of two to three more interconnect
configuration bits than would be required to provide full,
placement-independent, logic block and i/o interconnect.
We can, of course, derive a tighter bound for the reference 4-LUT design
by adapting Equation :
For the 1,024 4-LUT case identified above, ,
saving roughly 5 bits per LUT as suggested in Section
.
Eight of the 13 inputs on the XC4K go into 4-LUTs and all 16 of the XC5K
inputs go into 4-LUTs. Consequently, all of these arrays require
comparably fewer network configuration bits. The 4 inputs on the Altera 8K
LE also go into a 4-LUT. Since one input may also be used in a control
capacity, the reduction is slightly lower for the Altera 8K part.
As we see in Table , we ultimately expect devices to be wire
or switch limited. In the wire limited case, we may have free area under
long routing channels for memory cells. In fact, dense encoding of the
configuration space has the negative effect that control signals must be
routed from the configuration memory cells to the switching points. The
closer we try to squeeze the bit stream encoding to its minimum, the less
locality we have available between configuration bits and controlled
switches. In Figure
, for instance, we had to run
additional control lines into the crossbar to control the crosspoints.
These control lines compete with network wiring, exacerbating the routing
problems on a wire dominated layout.
In a multiple context or switched interconnect case, the effects of memory density are more pronounced. Limited wire resources are reused in time, making more efficient use of the wires and minimizing the effects of bisection wiring limitations. In these cases the chip needs to hold many configurations worth of memory simultaneously. If one is not careful about the density of the interconnect configuration encoding, the configuration memory stores can dominate chip area.
In the aforementioned DPGA Prototype [TEC +95a], for example, even with four on-chip contexts, wiring and switching accounted for over half of the die area. Network configuration memory made up about one fourth of the area. A factor of 2-4 increase in the configuration memory due to sparser encoding would have forced a substantial (20-60%) increase in die area.
Of course, the real problem associated with reconfiguration time is the i/o bandwidth limitation. It is certainly not necessary for the bits stored in the configuration memories to be identical to the off-chip interconnect specification or the specification transmitted across the chip boundary. In wire limited cases, where local memory cells are inexpensive or free and routing control signals are expensive, the device could decompress the configuration specification during configuration reload time.
For example, let us revisit the LUT interconnection example from
Section in the wire limited case. Here, we
suppose the technology costs dictate that the memory-cell-per-crosspoint
design is more area efficient than the denser encoding schemes identified
that section. We could build a single copy of the decoder for the
bit encoding
scheme. This single decoder could then be used to decode each 11 bits of
LUT input configuration into the 60 bits required to program the input
crossbar. As a result, we reduce the LUT input portion of the bitstream by
a factor of
over the unencoded case.
It is worthwhile to note at this point that the compression we are discussing here is design independent. That is, the bounds on configuration size we have derived throughout this paper are applicable across all possible interconnection patterns which a network may provide. It is also possible to exploit redundancy within the design to further compress the configuration bit stream in a design dependent manner.
Sparse interconnect configuration encodings can result in bloated bit stream configurations. Asymptotic growth rates suggest that dense encodings grow more slowly than desired interconnect requirements, placing designs in a wire and switch limited domain. Consequently, some encoding density may be judiciously sacrificed at the on-chip storage level to decrease control interconnect in programmable devices. Multiple-context components, on the other hand, have a greater demand for on-chip storage and merit denser encoding in order to make effective use of silicon area.
Sparsely encoded bit streams have their largest impact on configuration load time. As device size increases and the reconfigurable aspects of programmable devices are heavily exploited, the i/o bandwidth limited reconfiguration time becomes a significant performance factor. Dense configuration encoding can be exploited to minimize the external storage space required for configurations and the transmission time required to communicate configurations.
At a broader level, we have focussed on interconnect flexibility to establish gross bounds on the information required to configure a device. We demonstrated simple building blocks and metrics for gauging the flexibility of an interconnect and apprising the level of redundancy in a particular interconnect description. These tools can be useful for first order analysis of programmable interconnect designs.
Dense encodings can make a single switch depend on many configuration
bits. In Section we pointed out that decoding time affects
performance, in multicontext designs. In the DPGA LUT input case introduced
in Section
, full encoding to achieve 11 bits
requires that we look at all 11 bits to set at least one of the input
sources, making for moderately slow encoding.
To keep decoding time low, we might impose the additional constraint
that each input have an independent set of bits controlling its source
selection. In the full crossbar input case, for example, we had 4 bits
selecting each input source. The question then, is: Can we exploit the
input equivalence classes imposed by the -LUT to achieve a tighter
result while retaining independent input decoding?
We can, in fact, achieve a better encoding within these constraints. For
any choose
, case we can build a structure with only
crosspoints, requiring
bits. In the case at
hand (
,
), that means we need only 48 crosspoints while still
requiring 16 control bits. Alternately, it means we can supported as many
as 19 input sources with the same 16 control bits and only 64 crosspoints.
To achieve this bound, we organize the inputs as follows:
That is, the first input gets sources , the next
, and so on. The final input gets sources
.
Proof: Any set of inputs must contain one of the sources assigned to
input 0 (
). There are only 3 other sources left, so
if we need four inputs, one of them will come from this set. Now, we have
two cases:
Given the constraints of separate input selection, it is not possible to get away with any fewer crosspoints and still retain full connectivity.
Proof: Each of the LUT inputs must be able to select at least one
of the needed input values. If any of the inputs can select
from less than
values, then one could pick a set of
values
which were not in said input's set of choices. Since this input cannot
provide any of the selected inputs, the selected input pattern cannot be
routed. Therefore, the
inputs would not be able to select all
subsets of size
, as desired.
This arrangement is optimal since it is both complete and minimal.