Accurate Parallel Floating-Point AccumulationEdin Kadric, Paul Gurniak, and André DeHon
Proceedings of the IEEE Symposium on Computer Arithmetic, (Arith21, April 7--10, 2013)
Using parallel associative reduction, iterative refinement, and conservative termination detection, we show how to use tree reduce parallelism to compute correctly rounded floating-point sums in O(log N) depth at arbitrary throughput. Our parallel solution shows how we can continue to exploit Moore's Law scaling in transistor count to accelerate floating-point performance even when clock rates remain flat. Empirical evidence suggests our iterative algorithm only requires two tree reduce passes to converge to the accurate sum in virtually all cases. Furthermore, we develop the hardware implementation of a 250 MHz pipelined, native, residue-preserving IEEE-754 double-precision, floating-point adder on a Virtex 6 FPGA that requires only 48% more area than a standard adder without residue. Finally, we show how this module can be used as the base of a streaming accurate floating-point accumulation unit that can be tuned to consume m summands every cycle.
© 2013 IEEE. Authors/employers may reproduce or authorize others to reproduce The Work, material extracted verbatim from the Work, or derivative works to
the extent permissible under United States law for works authored by U.S. Government employees, and for the author's personal use or for
company or organizational use, provided that the source and any IEEE copyright notice are indicated, the copies are not used in any way that
implies IEEE endorsement of a product or service of any employer, and the copies themselves are not offered for sale.