Imperial College London > Talks@ee.imperial > CAS Talks > Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Grigorios Mingas.

Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth require- ments like Tall-Skinny QR factorization (TSQR) are highly appropriate for acceleration using FPG As. TSQR paral- lelizes QR factorization of tall-skinny matrices in a divide- and-conquer fashion by decomposing them into sub-matrices, performing local QR factorizations and then merging the intermediate results. As TSQR is a dense linear algebra problem, one would therefore imagine GPU to show better performance. However, the performance of GPU is lim- ited by the memory bandwidth in local QR factorizations and global communication latency in the merge stage. We exploit the shape of the matrix and propose an FPGA -based custom architecture which avoids these bottlenecks by using high-bandwidth on-chip memories for local QR factoriza- tions and by performing the merge stage entirely on-chip to reduce communication latency. We achieve a peak double- precision floating-point performance of 129 GFLO Ps on Virtex- 6 SX475T . A quantitative comparison of our proposed de- sign with recent QR factorization on FPG As and GPU shows up to 7.7× and 12.7× speed up respectively. Additionally, we show even higher performance over optimized linear al- gebra libraries like Intel MKL for multi-cores, CULA for GPUs and MAGMA for hybrid systems.

This talk is part of the CAS Talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Changes to Talks@imperial | Privacy and Publicity