Transit Note #82

Virtual Memory and Memory Viewpoints for Distributed Systems

Andre DeHon

Original Issue: April 1993

Last Updated: Fri Nov 5 13:15:19 EST 1993

Introduction

This note takes an informal look at various schemes for supporting globally-addressed data in distributed-memory systems. This note is motivated by discussions between the MIT Transit Project and the STING group at NEC Research Institute.

Virtual Memory

One question which arose during our discussions is: ``What sort of memory support is required underneath STING.'' As we understand it, STING currently uses a Virtual Memory (VM) system to provide a global-address space which is valid across all processors. We need to make sure we understand what services of the VM system are really necessary for STING to run properly. The VM system may be providing the following:

  1. single address space for all data
  2. referential transparency
  3. coherence
  4. access to larger address space than physical memory
  5. separation of process address spaces
  6. address remapping (data migration)
  7. garbage collection support
  8. protection
From our discussion, we assume (1) is the primary requirement. We expect the current implementation relies on the VM system for (2). Comments made during the discussion lead us to believe that the VM system is not really supplying (3). We are unclear on the extent to which (4) through (8) are required by STING.

Distributed-Memory Models

We agree that there needs to be a globally-named address space which allows access to the shared data on any node in the machine. The question then arises as to whether the non-local memory references are implicit or explicit. In the implicit case, the local/remoteness is transparent to the processor/compiler, while in the explicit case, the local/remoteness is exposed to the processor/compiler.

Implicit, Remote-Memory References

At one extreme, the memory system can export a pure, global, shared-memory view. In such a scenario, all memory references look identical to the processor and compiler. When a remote references is issued for data which is not local or cached locally, the memory system takes care of acquiring the data ``behind the processor's back''. This scenario allows the compiler and processor to be oblivious to data location. Since the remoteness of memory operations is transparent to the processor, the compiler and processor have no a priori knowledge of how long each memory operation will require. With a reasonable cache, good locality of reference, and little sharing, this scheme performs well. In this scheme execution is delayed for the full latency of each remote operations during cache misses. Even with a good caching scheme, the latency of single-thread execution will be increased by compulsory and sharing-based cache misses.

Explicit, Remote-Memory References

At the other extreme, the memory system can require explicit references of global data separate from local-memory references. In this memory view, as embodied in TAM [CSS +91] [TCS92] or Split-C [CDG +93], standard memory references are issued and serviced from local memory. Global references are issued explicitly and in a split-phase manner. The arrival of the reply data enables the computations which require the globally referenced data. Here, the remoteness of data is exposed to the processor and compiler. In particular, the compiler distinguishes memory references which may require moderately large latency from those it knows will complete quickly. The split-phase operations allow the compiler to explicitly schedule the data reference and usage separately. The compiler may then schedule useful work from the same execution stream during the access latency.

As articulated in [CSS +91], this model foregoes the use of a global-memory data cache. The results of global-memory fetches are placed in local memory where they are then operated upon. This, in effect, passes the job of locally caching global references to the compiler. The program can only take advantage of locality of reference for global data to the extent that the compiler can statically predict the data flow in the program.

Discussion

The implicit view provides simple, clean memory semantics to the compiler and processor. However, hiding the latency of memory references does not allow the compiler to schedule execution as efficiently as possible. Compilation efficiency suffers because the compiler does not know anything about the latency of a memory operation and because the compiler cannot schedule the reference and consumption of data separately. Caching in an implicit setting mitigates the latency of many memory operations by exploiting temporal and physical locality of reference.

Explicit systems with no global-memory cache, empower the compiler to schedule execution of instructions within a thread as efficiently as possible by exploiting the knowledge that some memory operations may require large latency and allowing split-phase memory references. Without a cache, the only locality these systems can exploit is that which can be detected statically and scheduled by the compiler.

Best of Both Worlds

If we add an explicit prefetch operation to shared-memory hardware support ( e.g. [LLG +91]), we can exploit most of the benefits of the split-phase memory operations. Further, if the compiler is designed to distinguish local and remote memory references as much as possible, it can gain almost all of the benefits of the explicitly referenced memory system. The only drawback relative to the pure, implicit, shared-memory case is that some prefetch operations may be unnecessary since the data may already be in the local cache. The prefetch operation will, in all likelihood, consume memory system bandwidth and, perhaps, execution cycles. Consequently, unnecessary prefetch operations may be detrimental to execution efficiency.

Open Issues

  1. Does (Can) the compiler know enough that it can manage the locality of reference such that an explicit scheme can perform as few remote references as a cached, implicit scheme?

  2. Can the compiler discover when it should prefetch data well enough that little performance is lost due to unnecessary prefetch requests?

Effects of Fast Context Switch

Context switching is an idea in the shared-memory community for mitigating the effects of long-latency, remote operations in an implicit, global-reference setting. The processor may switch to another context and process instructions from another thread during the latency of the remote operation ( e.g. [ALKK90] [ACD +91]). This has the advantage of doing some useful work during the remote operation latency. Depending on the cost of a context switch and the latency of the remote operation, this might allow more efficient processor utilization during the operation latency. However, context switching does not improve the latency of single-thread operation. Rather, it will tend to lengthen the single-thread execution latency. The context switch overhead, if any, will add to thread run-time. If the context does not immediately switch back to the stalled thread when the data arrives, the single-thread latency is also increased by the amount of time between the arrival of data and the time when execution is allowed to resume on the context requesting the data.

While fast context switches will be worthwhile to utilize processing resources efficiently when excess parallelism exists, even single-cycle context switches will not improve single-thread execution latency. In cases where the available parallelism is always much larger than the number of processors, longer single-thread execution time will not necessarily slow down application run-time. However, in cases where there are sections of code with limited parallelism, single-thread execution latency is a critical component determining overall application performance (see pp. 16-17 of [DeH93]).

Recommendations

On machines with no hardware, shared-memory support ( e.g. CM5 [Thi91]), the explicit, global-reference scheme is clearly the best way to obtain high performance. In all other cases, it is not obvious which provides higher performance.

We would like to see the extent to which the compiler can be used to schedule remote operations as efficiently as possible. If it is possible to dismiss the hardware-managed, global, shared-memory cache without a significant degradation in performance, that might be worthwhile. In the interest of gaining some insight into these issues, we are willing to support both models, at some level, to enable comparison.

MBTA

It will be moderately easy to implement support for an explicit scheme efficiently on MBTA (tn38) in the near future. The implicit scheme requires little from the hardware beyond an efficient message interface, relying on the compiler for efficient data management. Between the custom network interface and a co-processor to direct message traffic, we believe the basic MBTA hardware and low-level software can provide the requisite interfaces with moderately low overhead.

Using a co-processor with a tag-cache, we can add cache-management to MBTA to support an implicit, shared-memory scheme (tn63). The co-processor can be programmed to handle cache management and remote references behind the back of the computational processor, as necessary. Such a scheme will not handle cache-misses as efficiently as a custom, hardware-managed cache, but can be implemented in the moderately near term. At present, we do not fully understand all the issues associated with implementing such a global, shared-memory cache. The programmable, co-processor-managed, cache scheme will allow considerable flexibility for experimentation.

See Also...

References

ACD +91
Anant Agarwal, David Chaiken, Godfrey D'Souza, Kirk Johnson, David Kranz, John Kubiatowicz, Kiyoshi Kurihara, Beng-Hong Lim, Gino Maa, Dan Nussbaum, Mike Parkin, and Donald Yeung. The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor. MIT/LCS/TM 454, MIT, 545 Technology Sq., Cambridge, MA 02139, November 1991.

ALKK90
Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th International Symposium on Computer Architecture, pages 104-114. IEEE, May 1990.

CDG +93
David Culler, Andrea Dusseau, Seth Goldstein, Avind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine Yelic. Introduction to Split-C. Version 0.05, January 1993.

CSS +91
David E. Culler, Anurag Sah, Klaus Erik Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain Parallelism with Minimal Hardware Support: A Compiler-Controlled Threaded Abstract Machine. In Proceedings of the Fourth International Conference on the Architectural Support for Programming Languages and Operating Systems, April 1991.

DeH91
Andre DeHon. MBTA: Quick Overview. Transit Note 38, MIT Artificial Intelligence Laboratory, January 1991. [tn38 HTML link] [tn38 FTP link].

DeH93
Andre DeHon. Robust, High-Speed Network Design for Large-Scale Multiprocessing. Master's thesis, MIT, 545 Technology Sq., Cambridge, MA 02139, February 1993. [FTP link].

DSCM92
Andre DeHon, Thomas Simon, Nick Carter, and Henry Minsky. Charles: Second Revision MBTA Node. Transit Note 63, MIT Artificial Intelligence Laboratory, January 1992. [tn63 HTML link] [tn63 FTP link].

LLG +91
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, and John Hennessy. Overview and Status of the Stanford DASH Multiprocessor. In Norihisa Suzuki, editor, Proceedings of the International Symposium on Shared Memory Multiprocessing, pages 102-108. Information Processing Society of Japan, April 1991.

TCS92
Kenneth Traub, David Culler, and Klaus Schauser. Global Analysis for Partitioning Non-Strict Programs into Sequential Threads. In ACM Conference on Lisp and Functional Programming, pages 324-334. ACM, June 1992.

Thi91
Thinking Machines Corporation, Cambridge, MA. CM5 Technical Summary, October 1991.

MIT Transit Project