Transit Note #76
The Case of the Corrupted Acknowledgment

Andre DeHon

Original Issue: September 1992

Last Updated: Fri Nov 5 12:57:24 EST 1993

In their current manifestation RNP/ METRO style, source-responsible routing protocols are open to (at least) one fault-mode which is not handled well by the protocol itself and merits further considerations. This brief note describes the ``Corrupted Acknowledgment'' problem, some of its implications, and some thoughts on robust operation in its wake.

The Problem

The basic routing scheme(s) ((tn41) [EDP +92] [DeH92]) rely on checksummed messages to detect the corruption of data in the presence of faults which may occur dynamically and/or intermittently. A node receiving a data stream from the network will not commit to accepting the data until the node has verified the checksum. Similarly, reverse checksums and acknowledgments are used by the source to verify that the data arrived safely at the destination. Since the protocols of concern here are source-responsible, when checksums or acknowledgments fail, the destination may ignore the message. The source is responsible for retrying the connection.

By choosing checksums appropriately, we can make it highly unlikely that data corrupted by any fault in the interconnect will be accepted by the destination node. Whether or not the source notices the corruption itself, the acknowledgment from the destination informs the source of acceptance or rejection of the data by the destination node. We also choose the acknowledgment appropriately so that the chance that a destination node's indication of failure can be corrupted into appearing as a positive acknowledgment is sufficiently small. Hence, if the destination successfully returns a positive acknowledgment to the source, we have good reason to believe the message arrived uncorrupted.

Now, let us consider what happens when the source receives a corrupted checksum, corrupted data stream, or other indications that something is wrong. It is generally safe to assume something went wrong, but it may not be clear where the problem occurred. Particularly, if the fault only affected the return data from the destination node, the source can believe that an operation failed which the destination completed successfully. When the source thinks the operation has been corruptted, the source-responsible nature of the protocol has the source retry the operation since there is some (high) likelihood the operation has not been accepted by the destination. However, when this retry occurs in the case in which only the return data or acknowledgment was corrupted, the destination node will see the operation a second time.

Herein lies the potential problem. If performing an operation more than once changes the meaning of the message transmission ( i.e. is interpreted to mean something different than performing the operation exactly once), then the retry will result in errant behavior. An operation which can be performed multiple times and not behave any differently than if the operation were performed exactly once is termed idempotent. Conversely, an operation which cannot be performed multiple times without producing a semantically different result than when performed once is called non-idempotent. E.g. incrementing a counter is non-idempotent since the resulting counter value depends on the number of times the operation is performed. A simple write of a value to a memory location (with no other side-effects) is idempotent since it can be performed any number of times and the same value ends up stored in the memory location.

N.B. Neil Brock and William Weihl were responsible for first pointing out this potential problem to us.

Implications

As long as we are concerned about handling dynamic and transient faults, I believe the basic problem is unavoidable at the direct hardware level. Both the source and the destination must reach consensus on the successful completion of an operation. They rely on using the interconnection network to reach consensus and the communication network is susceptible to faults at any time.

At some level, then, it is necessary to guarantee that all operations performed over the network are idempotent. The question remains about at what level should the idempotence requirement be exposed. At one extreme, we could simply issue the caveat that all operations must be idempotent and leave the rest to the users of the network. At another, we could utilize intermediate protocols which turn non-idempotent operations into idempotent ones.

TCP Solution

TCP provides ``reliable'' data streams by using sequence numbers [Pos81]. When a source needs to communicate with a destination, the source arbitrates with the destination for a valid set of sequence numbers. The source annotates each unique packet of data transmitted to the destination with a different sequence number. The destination node keeps track of all the sequence numbers it has seen so that exactly one copy of each packet arriving at the destination is passed along to higher-level protocols. In this manner, any duplicates arising due to source-responsible retransmission are filtered out, making each message effectively idempotent at the protocol level above TCP.

While doable, this solution is not efficient for achieving high-speed communications in a large-scale multiprocessor context. The overhead in terms of the space and processing time required to track sequence numbers and filter messages based on sequence numbers could easily prove many times greater than the time and space required for basic message transmission.

Accepting the Limitation

We might ask, then, if it is possible to accept the limitation and explicitly design with basic communication primitives and transactions which exhibit idempotence. In this section, we consider the implications of this limitation.

Fetch-and-Op

Fetch-and-Op [GGK +84] [GS90] routines are often considered for obtaining synchronization and sequencing in a multiprocessor environment. In general a fetch-and-op routing proceeds as follows:

old_value = value
value = op(old_value[,args])
return old_value to caller

N.B. Operations (1) and (2) must be executed together, atomically, to achieve the behavior desired from fetch-and-op operations. Fetch-and-ops are only permissible in this model if the operation has the property that multiple applications of the operation and any arguments yield the save value. That is:

op(value,[args]) = op(op(value,[args]),[args])

As pointed out above, increment and decrement operations are not permissible. Operations which always set an individual bit (or bits) of a word are permissible.

Read/Writes

Raw read and write operations are allowable, but care must be taken if the read and write operations have side-effects due to the memory protocol.

If one has a memory system with full-empty bits [Smi78] [ALKK90], things become tricky. Write operations will be ok, as long as the system notes when the operation attempts to write the same data into a full-location and does not flag that as an error. Emptying-reads are more problematic. If the read is supposed to find the full-bit set and clear it, a retried read will find that the location is empty and possibly signal an error. Unless the memory system keeps track of the ``owner'' ( i.e. the node or thread emptying the location) of the emptied location and as a special-case allows the owner to reread the location, emptying-reads are non-idempotent.

Directory/owner based cache-coherence protocols [CFKA90] [ASHH88] have similar restrictions. Non-exclusive reads are probably not a problem; the cache-maintenance protocol just needs to deal with the fact that a single node/thread may read an operation multiple times, but need only be represented on its readers list once. Writes and exclusive reads will work fine as long as the protocol allows the ``owner'' of a memory location to re-read and re-write the location.

Remote Function Calls

A generic and powerful remote operation is that of remote function calls. These take many forms such as Alewife's Inter-Processor-Interrupts (IPI) [Kub90] or Active Messages (AM) [E +92]. Arbitrary functions are not idempotent and thus could not be used. Many functions can be protected by idempotent synchronization such that they will run exactly once per intended invocation. Any functions without side-effects is, of course, idempotent -- though most interesting remote functions will probably have side-effects.

An easy case to handle is unique or special purpose handlers which are known to run exactly once. Upon completion, such handlers can set a bit indicating that the handler cannot be run again (or, perhaps, until someone else clears the bit to reset the handler -- thought this must be considered carefully).

Example -- Put/Get

In one example of Active Messages, routines are defined for transferring large blocks of data in small message chunks ( i.e. PUT and GET) [E +92] [vEC92]. Each time a piece of the data is transferred a flag on the receiving node is incremented. The receiver can then detect the completion of the transfer when the flag reaches some pre-specified value. Since a counter is used for synchronization, these operations are non-idempotent.

If we replace the counter with an appropriately large bit-field, the operations can be idempotent. Upon completion of each transfer, the handler can set the appropriate bit. When all such bits are set, the receiver knows that the data has been transferred. If data must be retransmitted, re-writing the same data and re-setting the appropriate completion bit poses no semantic problems. This scheme, of course, uses memory less efficiently than the counter scheme. It also requires that the sender have a uniquely specified bit associated with each transfer. It does allow the arrival of individual chunks of data to be re-ordered arbitrarily just as the original GET/ PUT operations.

Summary

The corrupted acknowledgment is a real concern which we must be cognizant of at some level when designing applications and protocols for use on-top of our source-responsible message transmission protocol. Hiding the problem from the applications, in the manner done for distributed computer networks, is not necessarily efficient for basic communications on high-performance multiprocessors. The alternative is to design operations which are idempotent. This leaves one with a more restrictive set of possible operations, but may be acceptable.

At this point, I would like to get feedback.

Are these restrictions acceptable?
What sort of things might we want to do which cannot be easily and inexpensively recast into idempotent operations?
Are there other potential problems lurking here to watch for?
What other examples/classes of inexpensive-idempotent operations exist?

Transit Note #76
The Case of the Corrupted Acknowledgment

The Problem

Implications

TCP Solution

Accepting the Limitation

Fetch-and-Op

Read/Writes

Remote Function Calls

Example -- Put/Get

Summary

See Also...

References

Transit Note #76 The Case of the Corrupted Acknowledgment

The Problem

Implications

TCP Solution

Accepting the Limitation

Fetch-and-Op

Read/Writes

Remote Function Calls

Example -- Put/Get

Summary

See Also...

References

Transit Note #76
The Case of the Corrupted Acknowledgment