Accurate Parallel Floating-Point AccumulationEdin Kadric, Paul Gurniak, and André DeHon
IEEE Transactions on Computer, Volume 65, Number 11, pp. 3224--3238, November 2016.
Using parallel associative reduction, iterative refinement, and conservative early termination detection, we show how to use tree-reduce parallelism to compute correctly rounded floating-point sums in O(log(N)) depth. Our parallel solution shows how we can continue to exploit the scaling in transistor count to accelerate floating-point performance even when clock rates remain flat. Empirical evidence suggests our iterative algorithm only requires two tree-reduce passes to converge to the accurate sum in virtually all cases. Furthermore, we develop the hardware implementation of two residue-preserving IEEE-754 double-precision floating-point adders on a Virtex 6 FPGA that run at the same 250MHz pipeline speed as a standard adder. One adder creates the residue by truncation, requires only 22% more area than the standard adder, and allows us to support directed-rounding modes and to lower the cost of round-to-nearest modes. The second adder creates the residue while directly producing a round-to-nearest sum at 48% more area than a standard adder.
© 2016 IEEE. Authors/employers may reproduce or authorize others to reproduce The Work, material extracted verbatim from the Work, or derivative works to
the extent permissible under United States law for works authored by U.S. Government employees, and for the author's personal use or for
company or organizational use, provided that the source and any IEEE copyright notice are indicated, the copies are not used in any way that
implies IEEE endorsement of a product or service of any employer, and the copies themselves are not offered for sale.