Guaranteeing Idempotence for Tightly-Coupled, Fault-Tolerant Networks
Ian Eslick
Andre DeHon Thomas F. Knight Jr
Original Issue: May, 1994
Last Updated: Tue Jul 5 19:27:21 EDT 1994

This paper presents techniques for guaranteeing idempotent semantics for classes of network operations in fault-tolerant networks. This is accomplished through operation transformation and message filtering. The conventional response to a failed or corrupted message transmission is to retransmit the failed message. This retry mechanism can lead to duplicate message delivery. If duplicated messages are non-idempotent, duplicate message delivery places the system into an inconsistent state. We explore methods of guaranteeing message idempotence at both the application and network protocol levels.
In large multiprocessing systems, tolerance to component and wiring failure is an important design consideration. Fault tolerance concerns play a major role in many aspects of a multiprocessor system and features prominently in interconnection network design. If the network is not engineered to tolerate component and link faults, the failure rate of the network increases in direct proportion to the number of components and links in the network. Consequently, as we scale our systems in size, the system's mean time to failure (MTTF) drops dramatically. For systems in non-ideal field conditions, environmental factors may further reduce the MTTF, for example dirt, vibrations and temperature cycling can lead to faulty connections and broken components.
Multiprocessor networks can tolerate static faults, however few provide protection against transient faults. We know that transient faults account for the dominant portion of faults leading to system failure [And85]. The lack of emphasis on transient faults may rise from a variety of causes. Hardware has proven to be sufficiently reliable that hardware based failures are rare. In fact, for most systems which run full-blown operating systems (usually a UNIX style derivative), system software causes the dominant fraction of system failures. While this trend may prove true today, future computing systems will undoubtably find benefit in fault-tolerant computing substrates.
The MIT Transit project has constructed the Multipath Enhanced Transit Router Organization ( METRO) [DCB +94], a routing switch architecture engineered to provide fault tolerance while maintaining high performance in short-haul networking applications. To support fault-tolerant network operation, METRO supports multipath, multistage topologies, stochastic path selection and source responsible, end-to-end messaging protocols. We have developed the Metro Endpoint Routing Protocol ( MRP_ENDPOINT) [DeH93] which provides at-least-once, uncorrupted message delivery in METRO networks.
The MRP_ENDPOINT protocol layer requires that all network messages be idempotent, so that multiple receipts of a message by the communications endpoint does not yield a semantically different result than the receipt of a single message. Or alternatively stated, the message operation and arguments on the new value yields the new value. That is:
With this guarantee, MRP_ENDPOINT will ensure correct operation in the presence of transient faults.
This paper presents techniques for analyzing and transforming common
classes of non-idempotent operations into idempotent operations to
guarantee exactly-once delivery semantics. The collection of
transformation techniques presented is relevant to any communications
system that may duplicate messages. In addition, we discuss general
mechanisms in the short-haul networking domain for constructing
endpoint protocols that guarantee idempotence to higher layers of the
network. In Section we discuss idempotence issues in the
context of the METRO architecture, and briefly extend our
discussion to include other classes of networks.
Section
introduces operator transforms and evaluates
the complexity and cost of implementation. Section
introduces a general mechanisms for ensuring idempotence. We explore
optimization and implementation details of these mechanisms in
Section
.
This section describes the essential fault-tolerant features of the METRO architecture and the MRP_ENDPOINT protocol. We will abstract the problem of guaranteeing idempotence by detailing assumptions possible in various short-haul network settings. We argue that the placement of functionality which ensures idempotent network operation must be balanced by between requirements of simplicity and efficiency. In all cases, the performance of low-level network capabilities should never be hidden from programs.
The METRO architecture is circuit switched, providing a channel through the network between source and destination. This channel may be quickly reversed at the end of message transmission to allow data to flow back to the source. MRP_ENDPOINT uses this reversed connection to support end-to-end checksum acknowledgments, allowing the source endpoint to quickly verify that the data has been received and checksummed correctly by the destination. The source endpoint employs a simple source responsible retry scheme causing it to retransmit the message under any of the following conditions:
The return acknowledgment can be corrupted by transient faults, causing it to deviate from protocol specifications, leading to message retransmission. If the acknowledgment indicates corrupted receipt of a transmission, then the retry mechanism will again retransmit a second message to the destination endpoint. Consequently, MRP_ENDPOINT expects higher level protocols to ensure that only one message will ultimately be acted on by the system.
To illustrate the problems caused by duplicate non-idempotent message reception, consider an operation that decrements a synchronization counter. For example, dataflow synchronization in Berkeley's TAM [CSS +91] model. Redundant application of synchronization messages causes the synchronization counter to expire early. This may lead to a thread being posted before the data is available for the thread to compute.
Providing idempotence through filtering duplicate messages is an old problem in distributed computing. The transmission Control Protocol (TCP) provides a standard solution to this problem. TCP supports ``reliable'' data streams by using sequence numbers [Pos81]. When a source needs to communicate with a destination, the source arbitrates with the destination for a valid set of sequence numbers. The source annotates each unique packet of data transmitted to the destination with a different sequence number. The destination node keeps track of all the sequence numbers it has seen so that exactly one copy of each packet arriving at the destination is passed along to higher-level protocols. In this manner, any duplicates arising due to source-responsible retransmission are filtered out, making each message effectively idempotent to protocols running on top of TCP.
While correct, this particular solution is not well-suited for achieving low-latency communications in a large-scale multiprocessor context. Arbitrating for sequence numbers, filtering and reordering packets are together an unnecessary cost that need not be paid in many tightly-coupled networking scenarios. The overhead of these activities in terms of space and processing time can easily prove many times greater than the time and space required for basic message transmission. In these cases, transmission times are often much less than endpoint message handling time, whereas in the distributed computing case TCP style protocols are usually a small fraction of the overall transmission latency. The critical optimization domain in the multiprocessing context is reducing message handling time. This inclines us to work carefully to adopt solutions that are as lightweight as possible.
Designers constructing network protocols in the multiprocessing setting have the advantage of a constrained domain. As such, they can specialize protocol features and services for their domain to reduce the overhead of making idempotent guarantees. In comparison TCP must service a more general domain. For various topologies, network architectures and protocols we can:
Sequential delivery is a policy decision of the network protocol. Each node can wait for confirmation from the destination before transmitting the next message to that node. The cost of making this decision may be small compared to local or wide area networks. For networks with small message sizes, such as the CM-5, this scheme will drastically reduce latency. For the MRP_ENDPOINT protocol, sequential delivery is both a convenient and cheap policy decision due to the underlying features of the METRO network.
A great deal of research is being performed to enforce locality of reference in multiprocessor settings with the aim of promoting high performance. This property by definition limits the practical size of a particular node's working set of communicating nodes.
The third item is an assumption that is often reasonable in practical settings, such as embedded systems or fixed size architectures. While we might want architectures which can scale to attain continual parallelism benefits, particular machines and specialized architectures may have a fixed size. This fixed size again bounds our working set and provides us with a concrete range of expected node ID's.
The ``end-to-end argument'', argued by Saltzer et al. [SRC84], suggests that all functionality in layered communications systems should be pushed out as close as possible to the application which uses that function. If a function will often be duplicated by an application, then low-level implementation becomes redundant and therefore costly. Any decision to include functionality at a lower network layer must be carefully considered, trading off efficiency and complexity.
We illustrate in the following section that fully guaranteeing idempotence at the application level through transformations can entail considerable overhead. Placing this function in the network endpoint protocols allows us to solve the problem once for all applications, providing a convenient abstraction and overhead savings. This incorporation is done without significantly slowing raw network interface performance. Applications which do not require this guarantee may disable it and avoid performance penalties resulting from the additional functionality.
A last concern is that adding additional mechanism to a critical layer of the system makes it more complex from an engineering standpoint. Our design needs to integrate cleanly with existing software mechanisms and have minimal interaction with other system components. We show that having this mechanism in the network endpoint protocols of our tightly-coupled multiprocessor networks satisfies the tradeoffs in the end-to-end arguments and can be done with minimal effort once the design issues are well understood.
If our network does not guarantee idempotence, or this guarantee constitutes unacceptable overhead, then we must assure that our network messages are idempotent at the application level. We can characterize many common operations in multiprocessor systems as falling into one of the following categories:
The following subsections look at each of these categories and see how each either satisfies the idempotence requirement, or may be transformed into an idempotent operation.
A basic and essential messages in any MIMD-style multiprocessor system support synchronization. This often takes the form of a fetch-and-op routine [GS90] [GGK +84]. The fetch-and-op routine proceeds as follows:
Operations (1) and (2) have to be executed together, atomically, to achieve the behavior desired from fetch-and-op routines. Fetch-and-ops can only be executed multiple times without causing semantically different behavior if the operation is idempotent.
One important class of operations where this relation does not hold is the decrement and increment operations often used in synchronization. As described earlier in the context of the TAM model, multiple decrements of a synchronization counter will cause the counter to decrement to zero before the program expects, resulting in incorrect behavior.
To avoid this difficulty, we can replace synchronization decrements
with bit set operations. Because bit set operations are idempotent, a
fetch-and-op based on a bit set is also idempotent. We make a
to
space tradeoff to gain correct
operation. For small N, this tradeoff won't be noticed as most
synchronization variables are usually allocated in multiples of bytes.
A common problem in systems where messages can be delayed arbitrarily is deciding when to deallocate space. We have to assure that no more messages for that variable are going to arrive and overwrite new variable data. The trivial solution is to ignore the problem, and not deal with the overhead of maintaining all synchronization variables. This has obvious disadvantages, especially if there is no bound on the number of such variables.
A more robust solution is to replace a simple bit-set write with a test-and-set such that each variable is coupled with a monotonically increasing sequence number which corresponds to the variables identity. Messages that have carried over from previous allocations of that variable can test the identity of the variable to make sure it can perform the bit-set.
The test-and-set operation is a subset of the fetch-and-op and is used to implement system objects such as mutexes or semaphores. The results of these style operations can lead to a different class of intolerable behaviors than described above. These behaviors primarily take the form of errors rising from negative return results.
For example, if a network operation is trying to get a lock on a remote mutex, the first message may succeed, but the acknowledgment might have been corrupted. On the source node, the network interface is retrying the message, while on the destination, a confirmation of the lock is queued up for transmission back to the source.
The retransmitted test-and-set operation tests the lock and finds that it is locked. The destination node queues a fail message (assuming a non-blocking mutex lock request). If no more failures occur, then the source sees a success message followed by a fail message. Again, depending on the implementation of the program requesting the lock, this could be non-idempotent and lead to program failure.
To illustrate the subtle nature of analyzing these possible interactions, we could alternatively have a test-and-set operation which results in a return only on success. A mutex polling operation might work like this. If success messages are idempotent to the requesting program then the entire exchange is idempotent, and the problem is solved.
Raw reads and writes are idempotent as multiple writes yield the same value. Multiple remote reads are followed by local writes of the returned read data. Few applications deal directly in such raw reads and writes, and not many systems support such direct access to the network facilities. The Cray T3D is one exception to this rule, and provides this raw read/write interface to the programmer.
More often, raw reads and writes are intimately connected with operations such as data transfer, remote function invocation and synchronization primitives.
Multiple writes also pose a problem if subsequent message copies cross synchronization boundaries and override writes from other nodes in the system which occurred between the first message receipt and subsequent copies.
As the multiple writes issue indicates, these problems are usually addressed by introducing synchronization mechanisms to ensure that reads and writes are properly sequentialized. However, this sequentialization relies on clients of the system obeying the synchronization contract. If the synchronization mechanisms are blocking, then we do not violate the idempotence requirement so long as we check the blocking queue against incoming lock requests to ignore duplicate requests. If the mechanisms are non-blocking, then we encounter the same concerns as mentioned above in the test-and-set case.
Remote function invocation is very similar to writes in that they perform side effects into the destination's memory, or into a network messages to be sent back to the source. Fetch-and-op is a special case of the remote function invocation but since it is a common operation with a well defined behavior, it warrants individual attention. The problems are similar, as multiple invocations of the remote function may result in successively different results on both the source and destination nodes.
We can break down the space of possible erroneous results arising from duplicated remote function calls into those arising from multiple non-idempotent side-effects into the remote memory, and those arising from multiple return results. Unfortunately, the possible interactions of these errors and various system components can become quite complex.
It should prove useful to the system or application designer working the short-haul networking setting to see the salient features of these interactions. The potential for multiple function invocation leads to a series of operations being performed on remote data. If the remote function does not side effect into the remote node's memory, we are safe. If it does write into the remote memory, then each operation of the function that side-effects into that memory must be idempotent. This can be assumed if we know that all the data values accessed by the function and the operations on that data will be invariant between duplicate calls, then the function reduces to the case of a simple, deterministic write, and is consequently idempotent.
For return results, the problems created are highly dependent on how these results are integrated into the calling node computation. For example, in the TAM model a return message from a split-phased remote read activates an inlet thread, a small function meant to incorporate the returned data. These inlets usually copy the data from the message into an inlet or frame slot and post a thread on the system queue which is intended to deal with this data. Multiple return results would lead to the same data in the inlet slots, but would cause multiple thread invocations to be queued up on the originating node.
Clearly, handling remote function call duplication problems on a case-by-case basis can be self defeating as the complexity may lead to more system level errors than the idempotent properties will prevent. Though solutions are possible, they are often subtle and difficult to reason about. This indicates that a more general and pervasive guarantee mechanism is needed to handle cases where the solution is not straightforward.
To avoid these difficulties, we propose maintaining sequence numbers as a function of nodes in the system rather than as a function of objects on a node. This reduces the per-node space cost of our idempotence guarantee. With the proper set of assumptions, we can also allow this mechanism to scale nicely with system size and program complexity. Additionally, by accepting limitations on the number of nodes in the system, we can make our mechanism exceptionally lightweight.
A problem that arises in systems which track sequence numbers is knowing when to throw a <node ID:sequence number> pair away, as there is no bound on how long a duplicate message can be blocked within the network. In the METRO architecture, the circuit switched property of the network guarantees that all messages between nodes occur sequentially. That is, any particular message will only arrive at the destination after the prior message has been successfully acknowledged. This is also true of deterministic routing protocols in mesh style networks. Because only one pair need be maintained per node, this constrains the node storage space to be equal to the number of nodes within a particular node's communication group.
Well balanced communication and computation in a large multiprocessor will usually result in any particular node communicating with a small number of neighboring nodes. This is motivated by the low cost of neighbor vs. remote communications in most practical, scalable topologies. To our sequence ID tracking scheme, this means we will only have to track a small collection of <node ID:sequence number> pairs at any particular time.
Alternatively we can place limitations on the number of processors in our system. This is often a limitation of architectures which aren't intended to scale indefinitely or in specialized applications such as an embedded vision system. This limitation allows us to make the same assumption as with referential locality, but we are able to set a hard limit on the number of nodes we have to track.
Assuming only the sequential delivery of messages we can implement a software tracking scheme which maintains <node ID:sequence number> pairs which serve as a unique, monotonically increasing sequence identifiers for each node. This protocol will work best when the set of communicating nodes are limited and static. The protocol executes as follows:
ignore. Otherwise message.seq_ID
stored.seq_ID and perform the message operations.
There are a few potential problems with this scheme that warrant a closer look. If the patterns of communication shift, then old stored node ID's will take up memory space, but no longer be essential to ensure correct operation. The problem of knowing when to drop a particular ID can be solved through a LRU policy, arbitrating with nodes that have been idle the longest, once memory constraints require freeing up memory.
A similar problem arises if particular node, such as an I/O node, receives a large number of messages from all over the machine, in this case we have no choice but to cache the ID/number pairs of every communicating machine. As this is likely to be a special case, we can handle this in a more costly manner, keeping the common case execution simple and cheap.
In the case where a node runs out of storage space for ID pairs, it can drop or block all incoming non-matching messages until it can arbitrate with stored nodes to free storage space. Since the congestion caused by backing the computation into the network will slow down the arbitration, we may want to add provisions for an exponential backoff scheme similar to that used in the Ethernet protocol to free routing resources.
While dropping messages works well for source-responsible protocols, it may be much more expensive in mesh style networks. While the solutions rest heavily on the specific features of the network and routing protocol, one possible method of addressing the problem in adaptive routing networks is to bypass messages which have been backed into the network through a virtual channel that gives high priority to space resolution messages.
If messages arrive at a node sequentially then there are only going to be two distinct types of network messages: duplicate and non-duplicate. Thus we need only cache a single bit for each node we are communicating with. If the bit is identical to the last message, it is a duplicate, otherwise it is the next sequential message from that node. For systems with restricted ranges of node ID's, the node ID storage vectors may be reduced to bit vectors. Lookup of the single bit is done through offset rather than searching a table.
We can provide further speed enhancements through hardware mechanisms. This is a lower level solution than adding mechanism to the software transport layer, and while it provides higher performance, the policies required to flush an ID pair and the need to deal with overflow cases can make the design quite complex.
The principle performance enhancement in hardware comes from adding an associative memory which allows for single cycle lookups of stored <node:ID> pairs. The size of this hardware table is dependent on the expected working set of communicating nodes. Overflow cases should be trapped to software, which can be optimized to reduce the number of stored nodes by releasing those least recently used.
If we make these mechanisms orthogonal to normal message transmissions, we can make our own assurances about the idempotence of our messages and only pay the cost of the general mechanism when used. This is desirable when optimizing for high performance, but requires very complex reasoning about the problem. It is unlikely that in most cases the custom solution is more efficient than the general solution, especially when hardware mechanisms are used.
We can reduce overhead in systems for which sequential delivery is costly by establishing a partial ordering on those message which are non-idempotent, allowing idempotent messages to proceed as available.
The complexity involved in dealing with the transformation of network messages is daunting, such methods are crucial for enforcing correctness in most fault-tolerant networks. For certain classes of networks, we provide mechanisms which avoid the design problems associated with transforming complex operations. Such mechanisms incur a very low cost. These costs can be reduced further if we are allowed assumptions about network traffic, machine size and network guarantees.