Memory Transactions
Andre DeHon
Original Issue: May 1990
Last Updated: Tue Nov 9 12:40:34 EST 1993
In most current computers, there are exactly two memory transactions. The processor can either read data from a location in memory or write data to a location. In large parallel computers, or in any machine in which the response time for a read or write operation can be significantly greater than the instruction rate of the processor, we would like to consider splitting read and write operations into separate phases. A request phase would initiate the operation. A complete phase would actually perform the exchange of data desired.
These split phase operations allow a smart compiler to warn memory about potentially high-latency memory operations before the completion of the operation is critical to the processor. Ideally, this segregation will allow the latency of long memory operations ( e.g. cache misses or coherent writes) to be overlapped with useful computation. The warning to memory can occur as soon as the address to be read or written is known. If this warning can be scheduled sufficiently in advance of the completion operation, the completing read or write will require no more time than a read or write to the processor's local cache.
Often the processor will reference sequential words in memory in rapid succession. If the processor had to initiate each read or write with a separate instruction, the instruction count could grow quite large. Allowing the processor to specify the number of words to initiate at a given starting address will keep split phase operations from creating a large increase in code size. Specifying block read and write initiates in this manner will also allow the memory system and network to take advantage of spatial locality to decrease the latency of obtaining read and write access to the blocks of data.
e.g. These block initiates will allow the use of burst mode and page mode read and write operations. They will also allow a processing node to perform network operations on an entire block of data thus minimizing network traffic.
The issue of deciding how much of a block of data to initiate simultaneously exists and is left for the compiler to resolve. Initiating the largest blocks possible will decrease instruction count and decrease scalar latency assuming no interactions occur with other contexts or processors. However, initiating large blocks which are shared by active processes can aggravate thrashing between processes.
Sometimes it will be the case that the initiation phase of a read or write was not scheduled sufficiently in advance of the completion that the operation can complete within a single processor cycle. When additional cycles are needed for the memory operation to complete, there is a choice of what to do until the memory operation does complete. The processor can switch tasks as long as the costs of switching tasks is normally less than the expected delay of the memory system. Alternately, the processor can busy wait for the memory system to complete the requested operation. It probably depends on the circumstances of the actual read or write as to which option is preferable. Certainly, if there is sufficient parallelism and context switches are fast, switching contexts will probably allow a higher processor utilization. However, in critical or inherently sequential code regions, it would be preferable to wait for completions and allow the executing process to continue execution.
It is possible we can make some reasonable selections as to which of these operations to perform at compile time. Ideally, the decision of whether to context switch or busy wait should be made at run time. Ideally, the memory system should be able to indicate how much longer it expects it will be before the memory operation will complete. The processor can then take this expected memory latency information into account along with an idea of how much other useful work it can do when deciding whether to context switch or busy wait. While beneficial in scheduling useful work, a scheme this complicated is probably impractical to actually implement.
A smart memory coordinator might take care of most or all of the memory management, freeing the processor almost entirely for computation. In such a case memory allocation would be another basic operation.
Freshly allocated memory should generally be writable by the processor without further memory operations i.e. Immediately following an allocate, writes to the allocated memory should succeed without further need to initiate the write. The allocate implies write initiation. This, of course, can not always be true. It is unreasonable to fill a large portion of one's cache with a large structure. Perhaps the right thing is to make sure all small structures are write initiated, and large structures only have the first few words write initiated. It is not immediately obvious where to draw this line between small and large. One guess would be to write initiate a single cache line. Two word structures should certainly be considered small so that conses can be allocated and used without further need to initiate writes. Unfortunately, the limit between ``small'' and ``large'' structures would be an implementation peculiarity which the compiler would have to know.
What should be done when allocation fails? To answer this, we first must think about what it means for allocation to fail in a multiprocessor environment. It could mean that there is no memory chunk of that size anywhere. However, it is more likely to mean that the node simply does not know where to find a chunk of memory that large. The local memory coordinator will either have garbage collect locally (which is not guaranteed to solve the problem) or get the memory remotely (which, also, is not guaranteed to solve the problem). Either of these operations may involve considerable latency. Again, whether it is appropriate to busy wait or context switch when allocation fails will depend on where the allocate occurs.
If there really is no memory available of the requested size, then the processor should be interrupted and informed of this fact. For malloc in C, it would be appropriate to simply return a null pointer. However, we probably don't want to have this happen in general since it would require that every pointer returned from the allocate be checked to assure it was non-null.
While explicit deallocation is generally unnecessary in a garbage collected, tagged, symbolic processing environment, it might prove beneficial for efficient support of popular declarative languages. It might also be useful in a symbolic environment if temporaries were being allocated on the heap rather than on a stack. Exactly what deallocate would do is not exactly obvious. It could return memory to the free list, if one were being maintained.
If we split reads and writes into separate phases and provide alternate completion operations for the behaviors described above, we have the following set of basic operations:
N.B. Memory operations shown here are specified in the form:
is that register that will supply or receive the address associated with the memory operation and is the register which will receive or supply the data associated with the memory operation.
For initiate operations, the should hold the size of the block read or write request.
The idea of splitting initiate and complete phases for reads and writes has been advocated by Tom Knight for some time. This is a more general form of the prefetch operation supported by DASH [HGL90].
The distinction of processor behavior on cache miss when running with multiple contexts ( i.e. waiting versus context switching) was formulated clearly by the Alewife Project [Gro89].
Block optimizations were inspired by John Kubiatowicz's block transfer work for Alewife [Kub90].