Penn Logo
Vertical Line

Implementation of Computation Group

Divider

An NoC Traffic Compiler for efficient FPGA implementation of Parallel Graph Applications

Nachiket Kapre and André DeHon
Proceedings of the Reconfigurable Communication-Centric Systems on Chip, (RECOSOC, May 17--30, 2010)



Parallel graph algorithms expressed in a Bulk-Synchronous Parallel (BSP) compute model generate highly-structured communication workloads from messages propagating along graph edges. We can expose this structure to traffic compilers and optimization tools before runtime to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable FPGA NoCs. In this paper, we perform load balancing, placement, fanout routing and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements. This allows us to demonstrate speedups between 1.2x and 22x (3.5x mean), area reductions (number of Processing Elements) between 3x and 15x (9x mean) and dynamic energy savings between 2x and 3.5x (2.7x mean) over a range of real-world graph applications. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.

Divider
Room# 315, 200 South 33rd Street, Electrical and Systems Engineering Department, Philadelphia , University of Pennsylvania, PA 19104.