Transit Note #76

The Case of the Corrupted Acknowledgment

Andre DeHon

Original Issue: September 1992

Last Updated: Fri Nov 5 12:57:24 EST 1993

In their current manifestation RNP/ METRO style, source-responsible routing protocols are open to (at least) one fault-mode which is not handled well by the protocol itself and merits further considerations. This brief note describes the ``Corrupted Acknowledgment'' problem, some of its implications, and some thoughts on robust operation in its wake.

The Problem

The basic routing scheme(s) ((tn41) [EDP +92] [DeH92]) rely on checksummed messages to detect the corruption of data in the presence of faults which may occur dynamically and/or intermittently. A node receiving a data stream from the network will not commit to accepting the data until the node has verified the checksum. Similarly, reverse checksums and acknowledgments are used by the source to verify that the data arrived safely at the destination. Since the protocols of concern here are source-responsible, when checksums or acknowledgments fail, the destination may ignore the message. The source is responsible for retrying the connection.

By choosing checksums appropriately, we can make it highly unlikely that data corrupted by any fault in the interconnect will be accepted by the destination node. Whether or not the source notices the corruption itself, the acknowledgment from the destination informs the source of acceptance or rejection of the data by the destination node. We also choose the acknowledgment appropriately so that the chance that a destination node's indication of failure can be corrupted into appearing as a positive acknowledgment is sufficiently small. Hence, if the destination successfully returns a positive acknowledgment to the source, we have good reason to believe the message arrived uncorrupted.

Now, let us consider what happens when the source receives a corrupted checksum, corrupted data stream, or other indications that something is wrong. It is generally safe to assume something went wrong, but it may not be clear where the problem occurred. Particularly, if the fault only affected the return data from the destination node, the source can believe that an operation failed which the destination completed successfully. When the source thinks the operation has been corruptted, the source-responsible nature of the protocol has the source retry the operation since there is some (high) likelihood the operation has not been accepted by the destination. However, when this retry occurs in the case in which only the return data or acknowledgment was corrupted, the destination node will see the operation a second time.

Herein lies the potential problem. If performing an operation more than once changes the meaning of the message transmission ( i.e. is interpreted to mean something different than performing the operation exactly once), then the retry will result in errant behavior. An operation which can be performed multiple times and not behave any differently than if the operation were performed exactly once is termed idempotent. Conversely, an operation which cannot be performed multiple times without producing a semantically different result than when performed once is called non-idempotent. E.g. incrementing a counter is non-idempotent since the resulting counter value depends on the number of times the operation is performed. A simple write of a value to a memory location (with no other side-effects) is idempotent since it can be performed any number of times and the same value ends up stored in the memory location.

N.B. Neil Brock and William Weihl were responsible for first pointing out this potential problem to us.

Implications

As long as we are concerned about handling dynamic and transient faults, I believe the basic problem is unavoidable at the direct hardware level. Both the source and the destination must reach consensus on the successful completion of an operation. They rely on using the interconnection network to reach consensus and the communication network is susceptible to faults at any time.

At some level, then, it is necessary to guarantee that all operations performed over the network are idempotent. The question remains about at what level should the idempotence requirement be exposed. At one extreme, we could simply issue the caveat that all operations must be idempotent and leave the rest to the users of the network. At another, we could utilize intermediate protocols which turn non-idempotent operations into idempotent ones.

TCP Solution

TCP provides ``reliable'' data streams by using sequence numbers [Pos81]. When a source needs to communicate with a destination, the source arbitrates with the destination for a valid set of sequence numbers. The source annotates each unique packet of data transmitted to the destination with a different sequence number. The destination node keeps track of all the sequence numbers it has seen so that exactly one copy of each packet arriving at the destination is passed along to higher-level protocols. In this manner, any duplicates arising due to source-responsible retransmission are filtered out, making each message effectively idempotent at the protocol level above TCP.

While doable, this solution is not efficient for achieving high-speed communications in a large-scale multiprocessor context. The overhead in terms of the space and processing time required to track sequence numbers and filter messages based on sequence numbers could easily prove many times greater than the time and space required for basic message transmission.

Accepting the Limitation

We might ask, then, if it is possible to accept the limitation and explicitly design with basic communication primitives and transactions which exhibit idempotence. In this section, we consider the implications of this limitation.

Fetch-and-Op

Fetch-and-Op [GGK +84] [GS90] routines are often considered for obtaining synchronization and sequencing in a multiprocessor environment. In general a fetch-and-op routing proceeds as follows:

  1. old_value = value
  2. value = op(old_value[,args])
  3. return old_value to caller
N.B. Operations (1) and (2) must be executed together, atomically, to achieve the behavior desired from fetch-and-op operations. Fetch-and-ops are only permissible in this model if the operation has the property that multiple applications of the operation and any arguments yield the save value. That is: As pointed out above, increment and decrement operations are not permissible. Operations which always set an individual bit (or bits) of a word are permissible.

Read/Writes

Raw read and write operations are allowable, but care must be taken if the read and write operations have side-effects due to the memory protocol.

If one has a memory system with full-empty bits [Smi78] [ALKK90], things become tricky. Write operations will be ok, as long as the system notes when the operation attempts to write the same data into a full-location and does not flag that as an error. Emptying-reads are more problematic. If the read is supposed to find the full-bit set and clear it, a retried read will find that the location is empty and possibly signal an error. Unless the memory system keeps track of the ``owner'' ( i.e. the node or thread emptying the location) of the emptied location and as a special-case allows the owner to reread the location, emptying-reads are non-idempotent.

Directory/owner based cache-coherence protocols [CFKA90] [ASHH88] have similar restrictions. Non-exclusive reads are probably not a problem; the cache-maintenance protocol just needs to deal with the fact that a single node/thread may read an operation multiple times, but need only be represented on its readers list once. Writes and exclusive reads will work fine as long as the protocol allows the ``owner'' of a memory location to re-read and re-write the location.

Remote Function Calls

A generic and powerful remote operation is that of remote function calls. These take many forms such as Alewife's Inter-Processor-Interrupts (IPI) [Kub90] or Active Messages (AM) [E +92]. Arbitrary functions are not idempotent and thus could not be used. Many functions can be protected by idempotent synchronization such that they will run exactly once per intended invocation. Any functions without side-effects is, of course, idempotent -- though most interesting remote functions will probably have side-effects.

An easy case to handle is unique or special purpose handlers which are known to run exactly once. Upon completion, such handlers can set a bit indicating that the handler cannot be run again (or, perhaps, until someone else clears the bit to reset the handler -- thought this must be considered carefully).

Example -- Put/Get

In one example of Active Messages, routines are defined for transferring large blocks of data in small message chunks ( i.e. PUT and GET) [E +92] [vEC92]. Each time a piece of the data is transferred a flag on the receiving node is incremented. The receiver can then detect the completion of the transfer when the flag reaches some pre-specified value. Since a counter is used for synchronization, these operations are non-idempotent.

If we replace the counter with an appropriately large bit-field, the operations can be idempotent. Upon completion of each transfer, the handler can set the appropriate bit. When all such bits are set, the receiver knows that the data has been transferred. If data must be retransmitted, re-writing the same data and re-setting the appropriate completion bit poses no semantic problems. This scheme, of course, uses memory less efficiently than the counter scheme. It also requires that the sender have a uniquely specified bit associated with each transfer. It does allow the arrival of individual chunks of data to be re-ordered arbitrarily just as the original GET/ PUT operations.

Summary

The corrupted acknowledgment is a real concern which we must be cognizant of at some level when designing applications and protocols for use on-top of our source-responsible message transmission protocol. Hiding the problem from the applications, in the manner done for distributed computer networks, is not necessarily efficient for basic communications on high-performance multiprocessors. The alternative is to design operations which are idempotent. This leaves one with a more restrictive set of possible operations, but may be acceptable.

At this point, I would like to get feedback.

See Also...

References

ALKK90
Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th International Symposium on Computer Architecture, pages 104-114. IEEE, May 1990.

ANP86
Arvind, R. S. Nikhil, and K. K. Pingali. I-Structures: Data Structures for Parallel Computing. In Procceedings of the Workshop on Graph Reduction (Springer-Verlag Lecture Notes in Computer Science 279), September 1986.

ASHH88
Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. An Evaluation of Directory Schemes for Cache Coherence. In Proceedings of the 15th International Symposium on Computer Architecture, New York, June 1988. IEEE.

CFKA90
David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directory-Based Cache-Coherence in Large-Scale Multiprocessors. IEEE Computer, 23(6):41-58, June 1990.

DeH92
Andre DeHon. METRO LINK -- METRO Network Interface. Transit Note 75, MIT Artificial Intelligence Laboratory, September 1992. [tn75 HTML link] [tn75 FTP link].

DKM91
Andre DeHon, Thomas F. Knight Jr., and Henry Minsky. RNP: Fault Tolerant Routing Protocol. Transit Note 41, MIT Artificial Intelligence Laboratory, March 1991. [tn41 HTML link] [tn41 FTP link].

E +92
Thorsten von Eicken et al. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th Annual Symposium on Computer Architecture, Queensland, Australia, May 1992.

EDP +92
Eran Egozy, Andre DeHon, Samuel Peretz, Henry Minsky, and Thomas F. Knight Jr. METRO Architecture. Transit Note 73, MIT Artificial Intelligence Laboratory, August 1992. [tn73 HTML link] [tn73 FTP link].

GGK +84
A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. Designing an MIMD Parallel Computer. IEEE Transactions on Computers, C-32(2):175-189, February 1984.

GS90
G. Graunke and S.Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, 23(6):60-70, June 1990.

Kub90
John Kubiatowicz. Special Mechanisms for Multi-Model Support. Alewife Systems Memo 4, MIT Artificial Intelligence Laboratory, 545 Technology Square, Cambridge MA 02139, March 1990.

MDK91
Henry Minsky, Andre DeHon, and Thomas F. Knight Jr. RN1: Low-Latency, Dilated, Crossbar Router. In Hot Chips Symposium III, 1991.

Min91
Henry Q. Minsky. A Parallel Crossbar Routing Chip for a Shared Memory Multiprocessor. Master's thesis, MIT, 545 Technology Sq., Cambridge, MA 02139, January 1991. [FTP link].

Pos81
(Ed.) Jon Postel. Transmission Control Protocol -- DARPA Internet Program Protocol Specification. RFC 793, USC/ISI, Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, California, 90291, September 1981.

Smi78
B. J. Smith. A Pipelined, Shared-Resource MIMD Computer. In Proceedings of the 1978 International Conference on Parallel Processing, pages 6-8, 1978.

vEC92
Thorsten von Eicken and David E. Culler. Building Communication Paradigms with the CM-5 Active Message Layer (CMAM). Technical report, University of California, Berkeley Computer Science Division, Berkeley, CA, 94720, July 1992.

MIT Transit Project