Transit Note #13

Memory Transactions

Andre DeHon

Original Issue: May 1990

Last Updated: Tue Nov 9 12:40:34 EST 1993

Low Level Transaction Model

In most current computers, there are exactly two memory transactions. The processor can either read data from a location in memory or write data to a location. In large parallel computers, or in any machine in which the response time for a read or write operation can be significantly greater than the instruction rate of the processor, we would like to consider splitting read and write operations into separate phases. A request phase would initiate the operation. A complete phase would actually perform the exchange of data desired.

These split phase operations allow a smart compiler to warn memory about potentially high-latency memory operations before the completion of the operation is critical to the processor. Ideally, this segregation will allow the latency of long memory operations ( e.g. cache misses or coherent writes) to be overlapped with useful computation. The warning to memory can occur as soon as the address to be read or written is known. If this warning can be scheduled sufficiently in advance of the completion operation, the completing read or write will require no more time than a read or write to the processor's local cache.

Block Optimizations

Often the processor will reference sequential words in memory in rapid succession. If the processor had to initiate each read or write with a separate instruction, the instruction count could grow quite large. Allowing the processor to specify the number of words to initiate at a given starting address will keep split phase operations from creating a large increase in code size. Specifying block read and write initiates in this manner will also allow the memory system and network to take advantage of spatial locality to decrease the latency of obtaining read and write access to the blocks of data.

e.g. These block initiates will allow the use of burst mode and page mode read and write operations. They will also allow a processing node to perform network operations on an entire block of data thus minimizing network traffic.

The issue of deciding how much of a block of data to initiate simultaneously exists and is left for the compiler to resolve. Initiating the largest blocks possible will decrease instruction count and decrease scalar latency assuming no interactions occur with other contexts or processors. However, initiating large blocks which are shared by active processes can aggravate thrashing between processes.

When Complete Operations Can't

Sometimes it will be the case that the initiation phase of a read or write was not scheduled sufficiently in advance of the completion that the operation can complete within a single processor cycle. When additional cycles are needed for the memory operation to complete, there is a choice of what to do until the memory operation does complete. The processor can switch tasks as long as the costs of switching tasks is normally less than the expected delay of the memory system. Alternately, the processor can busy wait for the memory system to complete the requested operation. It probably depends on the circumstances of the actual read or write as to which option is preferable. Certainly, if there is sufficient parallelism and context switches are fast, switching contexts will probably allow a higher processor utilization. However, in critical or inherently sequential code regions, it would be preferable to wait for completions and allow the executing process to continue execution.

It is possible we can make some reasonable selections as to which of these operations to perform at compile time. Ideally, the decision of whether to context switch or busy wait should be made at run time. Ideally, the memory system should be able to indicate how much longer it expects it will be before the memory operation will complete. The processor can then take this expected memory latency information into account along with an idea of how much other useful work it can do when deciding whether to context switch or busy wait. While beneficial in scheduling useful work, a scheme this complicated is probably impractical to actually implement.

Memory Allocation

A smart memory coordinator might take care of most or all of the memory management, freeing the processor almost entirely for computation. In such a case memory allocation would be another basic operation.

Freshly allocated memory should generally be writable by the processor without further memory operations i.e. Immediately following an allocate, writes to the allocated memory should succeed without further need to initiate the write. The allocate implies write initiation. This, of course, can not always be true. It is unreasonable to fill a large portion of one's cache with a large structure. Perhaps the right thing is to make sure all small structures are write initiated, and large structures only have the first few words write initiated. It is not immediately obvious where to draw this line between small and large. One guess would be to write initiate a single cache line. Two word structures should certainly be considered small so that conses can be allocated and used without further need to initiate writes. Unfortunately, the limit between ``small'' and ``large'' structures would be an implementation peculiarity which the compiler would have to know.

When Allocations Fail

What should be done when allocation fails? To answer this, we first must think about what it means for allocation to fail in a multiprocessor environment. It could mean that there is no memory chunk of that size anywhere. However, it is more likely to mean that the node simply does not know where to find a chunk of memory that large. The local memory coordinator will either have garbage collect locally (which is not guaranteed to solve the problem) or get the memory remotely (which, also, is not guaranteed to solve the problem). Either of these operations may involve considerable latency. Again, whether it is appropriate to busy wait or context switch when allocation fails will depend on where the allocate occurs.

If there really is no memory available of the requested size, then the processor should be interrupted and informed of this fact. For malloc in C, it would be appropriate to simply return a null pointer. However, we probably don't want to have this happen in general since it would require that every pointer returned from the allocate be checked to assure it was non-null.

Deallocation

While explicit deallocation is generally unnecessary in a garbage collected, tagged, symbolic processing environment, it might prove beneficial for efficient support of popular declarative languages. It might also be useful in a symbolic environment if temporaries were being allocated on the heap rather than on a stack. Exactly what deallocate would do is not exactly obvious. It could return memory to the free list, if one were being maintained.

Basic Memory Operations

If we split reads and writes into separate phases and provide alternate completion operations for the behaviors described above, we have the following set of basic operations:

Write Initiate
indicates that the processor wishes to perform a write operation of the specified length to the specified address. The memory system should initiate a write. This may include obtaining write access to a memory location, depending on the coherence model in use.

Write Data
presents data to actually be written into memory. If this write does not complete immediately, the processor may context switch.

Write Data Wait
also presents data to actually be written into memory. If this write does not complete the processor will busy wait for completion.

Read Initiate
indicates that the processor wants the specified amount of data from the specified memory location. The memory system should attempt to obtain the specified data and place it in the local cache.

Read Data
informs the memory system that the processor wants the data at the specified location before it will continue computation. If this read does not complete immediately, the processor may context switch.

Read Data Wait
informs the memory system that the processor wants the data at the specified address. The processor will busy wait for a response.

Allocate
requests a block of memory of the specified size from the memory system. If the allocate cannot succeed immediately, the processor may context switch.

Allocate Wait
requests a block of memory of the specified size from the memory system. The processor will busy wait for the response.

Deallocate
returns a block of memory, specified by starting address and length, to the memory system for reuse.

Examples

N.B. Memory operations shown here are specified in the form:

is that register that will supply or receive the address associated with the memory operation and is the register which will receive or supply the data associated with the memory operation.

For initiate operations, the should hold the size of the block read or write request.

Cons

In this scheme cons becomes:

Inner Product

Possible implementations for: (inner-product A B) would be as follows.

Using non-blocked initiates:

Using blocked initiates:

Acknowledgments

The idea of splitting initiate and complete phases for reads and writes has been advocated by Tom Knight for some time. This is a more general form of the prefetch operation supported by DASH [HGL90].

The distinction of processor behavior on cache miss when running with multiple contexts ( i.e. waiting versus context switching) was formulated clearly by the Alewife Project [Gro89].

Block optimizations were inspired by John Kubiatowicz's block transfer work for Alewife [Kub90].

See Also...

References

DeH90
Andre DeHon. Early Processor Ideas. Transit Note 11, MIT Artificial Intelligence Laboratory, May 1990. [tn11 HTML link] [tn11 FTP link].

Gro89
Alewife Group. Instruction Set Summary of APRIL/SPARC Processor. Alewife Systems Memo 2, MIT Artificial Intelligence Laboratory, 545 Technology Square, Cambridge MA 02139, Nov 1989.

HGL90
John Hennesey, Anoop Gupta, and Monica Lam. DASH architecture. personal communications, April 1990.

Kub90
John Kubiatowicz. Special Mechanisms for Multi-Model Support. Alewife Systems Memo 4, MIT Artificial Intelligence Laboratory, 545 Technology Square, Cambridge MA 02139, March 1990.

MIT Transit Project