Imperial College London > Talks@ee.imperial > CAS Talks > ARC '12 practice

ARC '12 practice

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Grigorios Mingas.

Abstract (G.Mingas): Markov Chain Monte Carlo (MCMC) is a family of algorithms which is used to draw samples from arbitrary probability distributions in order to estimate – otherwise intractable – integrals. When the distribution is complex, simple MCMC becomes inefficient and advanced variations are employed. This paper proposes a novel FPGA architecture to accelerate Parallel Tempering, a computationally expensive, popular MCMC method, which is designed to sample from multimodal distributions. The proposed architecture can be used to sample from any distribution. Moreover, the work demonstrates that MCMC is robust to reductions in the arithmetic precision used to evaluate the sampling distribution and this robustness is exploited to improve the FPGA ’s performance. A 1072x speedup compared to software and a 3.84x speedup compared to a GPGPU implementation are achieved when performing Bayesian inference for a mixture model without any compromise on the quality of results, opening the way for the handling of previously intractable problems.

Abstract (A.Rafique): Iterative numerical algorithms with high memory bandwidth requirements but medium-size data sets (matrix size ∼ a few 100s) are highly appropriate for FPGA acceleration. This paper presents a streaming architecture comprising floating-point operators coupled with highbandwidth on-chip memories for the Lanczos method, an iterative algorithm for symmetric eigenvalues computation. We show the Lanczos method can be specialized only for extremal eigenvalues computation and present an architecture which can achieve a sustained single precision floating-point performance of 175 GFLO Ps on Virtex6-SX475T for a dense matrix of size 335×335. We perform a quantitative comparison with the parallel implementations of the Lanczos method using optimized Intel MKL and CUBLAS libraries for multi-core and GPU respectively. We find that for a range of matrices the FPGA implementation outperforms both multi-core and GPU ; a speed up of 8.2-27.3× (13.4× geo.mean) over an Intel Xeon X5650 and 26.2-116× (52.8× geo.mean) over an Nvidia C2050 when FPGA is solving a single eigenvalue problem whereas a speed up of 41-520× (103× geo.mean) and 131-2220× (408× geo.mean) respectively when it is solving multiple eigenvalue problems.

This talk is part of the CAS Talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Changes to Talks@imperial | Privacy and Publicity