Transit Note #99

Smart Compiler Advantages

Andre DeHon

Original Issue: November, 1993

Last Updated: Thu Dec 16 20:13:50 EST 1993

Purpose

This document addresses more succinctly the ways in which the smart compiler will be superior to conventional compiler technology. See (tn87) for basic development on the smart compiler.

Advantages

The following principles and optimization classes represent optimizations which the Smart Compiler will exploit to produce executables with lower execution latency than traditional compilers.

  1. Collapse abstraction artifacts from high-level code and system model

  2. Operator strength reduction

  3. Date type and structure strength reduction

  4. RACE (Reduced Average Case Execution-time)

  5. Accurate operation timing estimates for instruction sequence selection tradeoffs and scheduling

The following technologies are the key enablers which will allow the Smart Compiler to perform the classes of optimization described above.

  1. Partial Evaluation / Specialization
  2. Run-Time Feedback

Collapsing

As suggested in (tn93), traditional compilers take the procedure abstraction hierarchy as given. Traditional compilers faithfully reproduce almost all procedure calls at run-time, incurring the procedure overhead and inhibiting a large class of optimizations by treating procedure calls as barriers to optimization. The Smart compiler will collapse these artificial barriers to optimization, allowing the compiler to (1) eliminate much of the run-time procedure call overhead and (2) optimize across these artificial boundaries.

The Smart Compiler can make aggressive use of specialization and partial-evaluation to reduce run-time execution latency. Going beyond simply inlining, or open-coding, leaf procedures, the Smart Compiler can partial evaluate the program with respects to any data known at compile time. Further, run-time feedback will allow the program to also specialize segments of code around the values which are seen most often during dynamic execution, but which cannot be determined through static program analysis.

Operator Strength Reduction

Traditional compilers exploit a limited form of operator strength reduction to optimize code inside loops. The goal of strength reduction is to replace a more expensive, often more general, operation with a less expensive routine when the cheaper routine can suffice. In traditional compiler, multiplication by a loop index can often be replaced by addition of a constant on each loop pass. On processors where multiplication is much more expensive than addition, this can be a significant optimization.

The Smart Compiler can take this general principle to new extremes using specialization and feedback from run time. Collapsing and specialization allow the compiler to know more accurately how a given operation is being used. This can often disambiguate the properties required of an operator and allow a less expensive, possibly more specialized, operator to be used in its place. In C, we can think about changing multiplication to addition or left-shift operations when the use can be determined to be sufficiently restricted. In LISP, we can think about changing generic operators ( e.g. a generic addition operation which can operate on any combination of numeric options of a variety of types) into specialized operators ( e.g. an integer addition operator which takes two operands, both of which must be integers, and produced a result which is an integer) when the types are known or the range of types can be limited.

When static analysis cannot limit the range of types or range of values to a sufficiently small set to enable significant strength reduction, feedback from run-time often can. By observing the range of values which occur in practice, the compiler can produce strength reduced versions of code which match dynamic operator usage at run-time. Since dynamic feedback cannot guarantee to uncover all possible values which an operator can see, code sequences specialized based on run-time information would be produced alongside the more general code and used when run-time predicates indicate it is safe to use the specialized code sequence instead of the general code.

Data-Type and Structure Strength Reduction

In a similar manner, when an abstract data structure is used in a restricted manner, it can often be profitably replaced by a data-structure with less-expensive access and maintenance operations. For instance, a LISP list which is created with uniform data elements and is not arbitrarily ``grown'', can be replaced with an array. In many cases, this will make traversal and access much cheaper since the array requires less indirection to access any given element. In many cases, this class of transforms can achieve a significant reduction in the run-time overhead associated with more powerful data structures by moving the overhead associated with data-structure traversal from run- ime to compile time. Again, when static information cannot sufficiently limit the use of a data structure to allow strength reduction, run-time feedback can indicate the typical use of a data structure during execution and hence point to strength reductions which are worthwhile to perform in practice. As with operator strength reduction, since the dynamic information does not guarantee that the greater generality will never be required, the specialized code should include the ability to fall back to the full-strength data structure in the case that execution requires features not available in the strength-reduced data structure.

RACE

The Smart Compiler employs the Reduced Average Case Execution-time (RACE) principle to decrease execution latency. The key idea here is that dynamic run-time feedback allows us to determine a probability-distribution function (PDF) for the data seen by the program or routines in the program. We can use this PDF to rewrite the program to minimize expected running time with respect to the common input sequences.

One way to conceptualize how this transform works is to draw analog to information theory and data compression. Information theory says that if we want to distinguish among items, all of which are equally likely, we must communicate bits of data. In this case, we can encode each item as a bit string of length . Further, if we know that the data are not equally likely, we can, on average, get away transmitting fewer bits by choosing an encoding which weights the bit-length of each data item in inverse proportion to its probability (bit-len ). More frequently transmitted items are encoded with fewer bits, resulting in reduced transmition data. In fact, information theory goes on to establish that the encoding length achieved by selecting probabilities as above is the most compact encoding achievable that does not lose any information present in the transmitted data stream. This phenomena is exactly the one which is commonly used to compress data to make most effective use of limited transmission bandwidth and/or limited disk space. In some cases, one uses an a priori probability distribution -- assuming it is close enough to the actual probability distribution and avoiding the need to recompute the PDF and hence encoding scheme on the fly for each dataset ( e.g. Huffman encoding). For maximum efficiency, one customizes the encoding to the datastream.

RACE does the same thing for execution latency. Abstractly, we can consider the penultimate optimization of a program which ran in minimum time on all inputs -- assuming all inputs were equally likely. This would correspond to our distinguishable items with uniform probability in the information theoretic case. Now, since we know that all inputs to a program are generally not equally likely, we can consider analogous transforms to reduce the average-case execution latency by generating code which runs faster for more likely inputs (execution sequences). As in the information encoding case, this may require unlikely input sequences to run longer than they ran in the uniform likelihood case. In summary, the Smart Compiler achieves run-time (execution latency) compression by compiling executables who's critical path latency for a set of inputs is proportional to the inverse probability of the data.

We can think of conventional trace-scheduling ( e.g. [FR91] [Fis81] ) as picking an a priori PDF on input data sets and generating a single executable encoded to perform well to the extent that the data input is close to the a priori PDF. We can view the dynamic, run-time feedback advocated for the smart compiler as an adaptive compression scheme which adapts the encoding with respect to the data seen -- still assuming some homogeneity between past and future runs.

Note that conventional trace-scheduling is probably closest in analogy to textual compression based on character frequency. It mostly looks at small atoms (branches) in fairly local contexts. Just as we can often achieve higher data compression rates by selecting higher-level, structured, objects for compression -- we can expect to achieve superior (lower) average-case run-time latency by extending program reorganization and specialization to higher-levels based on run-time feedback.

One consequent of applying the RACE principle, is LNUFE, Little or No Unused Functionality Expense. In many feature rich programming languages or environments, the environment provides many options and features. With conventional compiler technology, each of these features has an added cost and run-time even when they are not used. In some cases, this cost shows up as, potentially frequent, run-time checks or additional indirection to resolve a function reference. Any given individual, however, seldom uses the full set of features. By doing aggressive RACE style optimization, the unused, or seldom used, functionality is pushed out of the critical path for execution. All of the functionality which is used is pulled in and optimized aggressively into the normal execution path.

E.g. consider a windowing system. Often there are many options for the user to customize the window behavior and change its behavior during the object's lifetime. To provide all the features, the windowing code is parameterized in many ways and has many checks for feature utilization. A given user or application, however, generally has a set of options they use. Additionally, for a given user or application, many of the parameters remain fixed. The user does not use all of the potential flexibility provided by the system. The smart compiler can use the RACE principle to specialize higher-performance window procedures for each user by using feedback on the user's usage patterns. Once optimized, the unused features have virtually no cost to the end user.

Accurate Timing

Conventional compilers optimize based on static predictions of operator timing. Such timings seldom provide accurate reflection of run-time costs when operator execution time is data-dependent. Further, to minimize the amount of operator information included, operator timings are usual given for an abstract architectural family rather than for a specific machine. By extracting timing information from execution, the Smart Compiler will be better able to judge the real costs of each operation in context and hence make better decisions about operator costs and operator scheduling.

Accurate timing also allows the compiler to deal robustly with scenarios which are too complicated to model explicitly for the compiler. Cache effects and multiprocessor network congestion are two likely examples.

Aggressive inlining can often produce larger code sequences. A ``more optimal'' code sequence which exceeds the size of the instruction cache may run more slowly than a ``less optimal'' sequence which fits in the cache. Without explicitly modeling the cache size, the Smart Compiler can determine from run-time feedback when a particular transform has a performance degrading side-effect.

The cost of each memory operation a program performs will depend on whether the reference causes a virtual-memory page fault or cache-miss. Traditional compilers treat all memory references as equally costly. With run-time feedback on operator costs, the smart compiler can schedule more efficiently knowing which memory operations are likely to operate more slowly. Further, this accurate reference timing allows the compiler to reason about the benefits of prefetching a particular data value. When run-time feedback indicates that the value is likely to be cached, it might be most efficient to not insert a prefetch for a particular memory location. However, when run-time feedback indicates that access to a particular item is slow, the compiler can experiment with scheduling an explicitly prefetch for the data earlier in the execution stream. Accurate timing from run-time feedback allows the Smart Compiler to evaluate the effects of its experiments and select the most beneficial code sequences.

Without modeling the exact communication pattern occurring in a program, it is difficult, if not impossible, to predict how long it will take a particular network operation to complete. In almost all modern networks, congestion can have a large effect on network operation latency. Feedback on the time required by each network operation in its context will allow the compiler to schedule network operations more efficiently. Further, when there are computation versus communication tradeoffs, accurate operation timing information allows the compiler to tradeoff communication with computation optimally.

Accurate accounts of memory access times can also be used to optimize data placement in distributed-memory multiprocessors. The Smart Compiler can collect statistics on dynamic memory utilization by processor location and reorganize the placement of data to reduce access latency and hence overall execution time.

See Also...

References

DEK93
Andre DeHon, Ian Eslick, and Thomas F. Knight Jr. Toward 21st Century Computing: Transit Perspective. Transit Note 93, MIT Artificial Intelligence Laboratory, September 1993. [tn93 HTML link] [tn93 FTP link].

DEMK93
Andre DeHon, Ian Eslick, John Mallery, and Thomas F. Knight Jr. Prospects for a Smart Compiler. Transit Note 87, MIT Artificial Intelligence Laboratory, June 1993. [tn87 HTML link] [tn87 FTP link].

Fis81
Joseph Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers, 30(7):478-490, July 1981.

FR91
Joseph A Fisher and B. Ramakrishna Rau. Instruction-Level Parallel Processing. Science, 253:1233-1241, September 1991.

MIT Transit Project