|
||||||||||||||||
An NoC Traffic Compiler for efficient FPGA implementation of Parallel Graph ApplicationsNachiket Kapre and André DeHonProceedings of the Reconfigurable Communication-Centric Systems on Chip, (RECOSOC, May 17--30, 2010) Parallel graph algorithms expressed in a Bulk-Synchronous Parallel (BSP) compute model generate highly-structured communication workloads from messages propagating along graph edges. We can expose this structure to traffic compilers and optimization tools before runtime to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable FPGA NoCs. In this paper, we perform load balancing, placement, fanout routing and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements. This allows us to demonstrate speedups between 1.2x and 22x (3.5x mean), area reductions (number of Processing Elements) between 3x and 15x (9x mean) and dynamic energy savings between 2x and 3.5x (2.7x mean) over a range of real-world graph applications. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.
|