Virtual Memory and Memory Viewpoints for Distributed Systems
Andre DeHon
Original Issue: April 1993
Last Updated: Fri Nov 5 13:15:19 EST 1993

This note takes an informal look at various schemes for supporting globally-addressed data in distributed-memory systems. This note is motivated by discussions between the MIT Transit Project and the STING group at NEC Research Institute.
One question which arose during our discussions is: ``What sort of memory support is required underneath STING.'' As we understand it, STING currently uses a Virtual Memory (VM) system to provide a global-address space which is valid across all processors. We need to make sure we understand what services of the VM system are really necessary for STING to run properly. The VM system may be providing the following:
We agree that there needs to be a globally-named address space which allows access to the shared data on any node in the machine. The question then arises as to whether the non-local memory references are implicit or explicit. In the implicit case, the local/remoteness is transparent to the processor/compiler, while in the explicit case, the local/remoteness is exposed to the processor/compiler.
At one extreme, the memory system can export a pure, global, shared-memory view. In such a scenario, all memory references look identical to the processor and compiler. When a remote references is issued for data which is not local or cached locally, the memory system takes care of acquiring the data ``behind the processor's back''. This scenario allows the compiler and processor to be oblivious to data location. Since the remoteness of memory operations is transparent to the processor, the compiler and processor have no a priori knowledge of how long each memory operation will require. With a reasonable cache, good locality of reference, and little sharing, this scheme performs well. In this scheme execution is delayed for the full latency of each remote operations during cache misses. Even with a good caching scheme, the latency of single-thread execution will be increased by compulsory and sharing-based cache misses.
At the other extreme, the memory system can require explicit references of global data separate from local-memory references. In this memory view, as embodied in TAM [CSS +91] [TCS92] or Split-C [CDG +93], standard memory references are issued and serviced from local memory. Global references are issued explicitly and in a split-phase manner. The arrival of the reply data enables the computations which require the globally referenced data. Here, the remoteness of data is exposed to the processor and compiler. In particular, the compiler distinguishes memory references which may require moderately large latency from those it knows will complete quickly. The split-phase operations allow the compiler to explicitly schedule the data reference and usage separately. The compiler may then schedule useful work from the same execution stream during the access latency.
As articulated in [CSS +91], this model foregoes the use of a global-memory data cache. The results of global-memory fetches are placed in local memory where they are then operated upon. This, in effect, passes the job of locally caching global references to the compiler. The program can only take advantage of locality of reference for global data to the extent that the compiler can statically predict the data flow in the program.
The implicit view provides simple, clean memory semantics to the compiler and processor. However, hiding the latency of memory references does not allow the compiler to schedule execution as efficiently as possible. Compilation efficiency suffers because the compiler does not know anything about the latency of a memory operation and because the compiler cannot schedule the reference and consumption of data separately. Caching in an implicit setting mitigates the latency of many memory operations by exploiting temporal and physical locality of reference.
Explicit systems with no global-memory cache, empower the compiler to schedule execution of instructions within a thread as efficiently as possible by exploiting the knowledge that some memory operations may require large latency and allowing split-phase memory references. Without a cache, the only locality these systems can exploit is that which can be detected statically and scheduled by the compiler.
If we add an explicit prefetch operation to shared-memory hardware
support ( e.g. [LLG +91]), we can exploit most of the
benefits of the split-phase memory operations.
Further, if the compiler is designed to distinguish local and remote memory
references as much as possible, it can gain almost all of the benefits of
the explicitly referenced memory system. The only drawback relative to the
pure, implicit, shared-memory case is that some prefetch operations may be
unnecessary since the data may already be in the local cache. The prefetch
operation will, in all likelihood, consume memory system bandwidth and,
perhaps, execution cycles. Consequently, unnecessary prefetch operations
may be detrimental to execution efficiency.
Context switching is an idea in the shared-memory community for mitigating the effects of long-latency, remote operations in an implicit, global-reference setting. The processor may switch to another context and process instructions from another thread during the latency of the remote operation ( e.g. [ALKK90] [ACD +91]). This has the advantage of doing some useful work during the remote operation latency. Depending on the cost of a context switch and the latency of the remote operation, this might allow more efficient processor utilization during the operation latency. However, context switching does not improve the latency of single-thread operation. Rather, it will tend to lengthen the single-thread execution latency. The context switch overhead, if any, will add to thread run-time. If the context does not immediately switch back to the stalled thread when the data arrives, the single-thread latency is also increased by the amount of time between the arrival of data and the time when execution is allowed to resume on the context requesting the data.
While fast context switches will be worthwhile to utilize processing resources efficiently when excess parallelism exists, even single-cycle context switches will not improve single-thread execution latency. In cases where the available parallelism is always much larger than the number of processors, longer single-thread execution time will not necessarily slow down application run-time. However, in cases where there are sections of code with limited parallelism, single-thread execution latency is a critical component determining overall application performance (see pp. 16-17 of [DeH93]).
On machines with no hardware, shared-memory support ( e.g. CM5 [Thi91]), the explicit, global-reference scheme is clearly the best way to obtain high performance. In all other cases, it is not obvious which provides higher performance.
We would like to see the extent to which the compiler can be used to schedule remote operations as efficiently as possible. If it is possible to dismiss the hardware-managed, global, shared-memory cache without a significant degradation in performance, that might be worthwhile. In the interest of gaining some insight into these issues, we are willing to support both models, at some level, to enable comparison.
It will be moderately easy to implement support for an explicit scheme efficiently on MBTA (tn38) in the near future. The implicit scheme requires little from the hardware beyond an efficient message interface, relying on the compiler for efficient data management. Between the custom network interface and a co-processor to direct message traffic, we believe the basic MBTA hardware and low-level software can provide the requisite interfaces with moderately low overhead.
Using a co-processor with a tag-cache, we can add cache-management to MBTA to support an implicit, shared-memory scheme (tn63). The co-processor can be programmed to handle cache management and remote references behind the back of the computational processor, as necessary. Such a scheme will not handle cache-misses as efficiently as a custom, hardware-managed cache, but can be implemented in the moderately near term. At present, we do not fully understand all the issues associated with implementing such a global, shared-memory cache. The programmable, co-processor-managed, cache scheme will allow considerable flexibility for experimentation.