Implementation of Computation Group

Home

Accurate Parallel Floating-Point Accumulation

Edin Kadric, Paul Gurniak, and André DeHon
Proceedings of the IEEE Symposium on Computer Arithmetic, (Arith21, April 7--10, 2013)

Using parallel associative reduction, iterative refinement, and conservative termination detection, we show how to use tree reduce parallelism to compute correctly rounded floating-point sums in O(log N) depth at arbitrary throughput. Our parallel solution shows how we can continue to exploit Moore's Law scaling in transistor count to accelerate floating-point performance even when clock rates remain flat. Empirical evidence suggests our iterative algorithm only requires two tree reduce passes to converge to the accurate sum in virtually all cases. Furthermore, we develop the hardware implementation of a 250 MHz pipelined, native, residue-preserving IEEE-754 double-precision, floating-point adder on a Virtex 6 FPGA that requires only 48% more area than a standard adder without residue. Finally, we show how this module can be used as the base of a streaming accurate floating-point accumulation unit that can be tuned to consume m summands every cycle.

© 2013 IEEE. Authors/employers may reproduce or authorize others to reproduce The Work, material extracted verbatim from the Work, or derivative works to the extent permissible under United States law for works authored by U.S. Government employees, and for the author's personal use or for company or organizational use, provided that the source and any IEEE copyright notice are indicated, the copies are not used in any way that implies IEEE endorsement of a product or service of any employer, and the copies themselves are not offered for sale. (IEEE Copyright)

Author's local PDF copy of paper - parallel_fpaccum_arith2013.pdf
Distribution of Bluespec System Verilog source code for designs in the paper

Home

Room# 315, 200 South 33rd Street, Electrical and Systems Engineering Department, Philadelphia , University of Pennsylvania, PA 19104.