Imperial College London > Talks@ee.imperial > CAS Talks > ATHEENA Gets Buff(er)ed: Modelling and Constructing Early-Exit Network FPGA Accelerators

ATHEENA Gets Buff(er)ed: Modelling and Constructing Early-Exit Network FPGA Accelerators

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact George A Constantinides.

The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPG As. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we propose to shift the focus to using the varying difficulty of individual data samples to further improve efficiency and reduce average compute for classification. Input-dependent computation allows for the network to make runtime decisions to finish a task early if the result meets a confidence threshold. Early-Exit network architectures have become an increasingly popular way to implement such behaviour in software.

We create \textit{A Toolflow for Hardware Early-Exit Network Automation} (ATHEENA), an automated FPGA toolflow that leverages the probability of samples exiting early from such networks to scale the resources allocated to different sections of the network. We employ probabilistic modelling methods derived from queueing theory in order to accurately determine throughput under finite buffer constraints dictated by limited on-chip Block-RAM resources. These analytical methods have been verified against custom event-driven simulations of an abstract accelerator method and are, qualitatively, much faster to run, and quantitatively, within 2 percent of the simulated throughput. The toolflow uses the data-flow model of fpgaConvNet, extended to support the control-flow of Early-Exit networks as well as Design Space Exploration to optimize the generated streaming architecture hardware with the goal of increasing throughput/reducing area while maintaining accuracy.

The original ATHEENA results were calculated using a naïve, infinite buffer assumption, these experimental results on three different networks demonstrate a throughput increase of $2.00\times$ to $2.78\times$ compared to an optimized baseline network implementation with no early exits. Additionally, the toolflow can achieve a throughput matching the same baseline with as low as $46\%$ of the resources the baseline requires. We plan to explore buffer placement and continue expanding the benchmarks to obtain improved results.

This talk is part of the CAS Talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Changes to Talks@imperial | Privacy and Publicity