An Open-source HLS Fully Parameterizable Matrix Multiplication Library for AMD FPGAs

Authors

  • Angelos Athanasiadis School of Electrical and Computer Engineering, Aristotle University of Thessaloniki
  • Nikolaos Tampouratzis Department of Industrial Engineering and Management, International Hellenic University
  • Ioannis Papaefstathiou School of Electrical and Computer Engineering, Aristotle University of Thessaloniki

Keywords:

High-Performance, Computing, Neural Networks, Matrix Multiplication, AMD FPGA, Vitis

Abstract

One common characteristic of High-Performance Computing (HPC) and Cyber-Physical Systems (CPS) is their need for heterogeneous energy-efficient solutions. In this work we present a library for FPGA-accelerated dense matrix multiplication which is flexible, open-source, written in purely synthesizable C and has no dependencies on the actual hardware implementation tools. Our library is designed so as to support arbitrary array sizes and accuracy, making it a versatile and adaptable solution that meets the diverse computational requirements of applications all the way from CPS to HPC. Our approach provides an adaptable solution that efficiently exposes the flexibility and performance of the FPGAs to both novice and expert developers which is not the case with the black-box libraries provided by the FPGA manufacturers. Our approach has been evaluated in a number of state-of-the-art AMD FPGAs; the end results demonstrate that the presented implementations can achieve 9x, 34x and 3x gains, in terms of energy efficiency, when compared with embedded, high-end CPUs and GPUs respectively. Moreover, our solution matches or slightly outperforms the most advanced similar FPGA-tailored approach while also being much more flexible and designer-friendly while also library-independent.

References

J. de Fine Licht, M. Besta, S. Meierhans, and T. Hoefler, “Transformations of high-level synthesis codes for high-performance computing,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 5, pp. 1014–1029, May 2021.

A. Ahmad and M. Pasha, “Optimizing hardware accelerated general matrix-matrix multiplication for cnn on fpgas,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 67, pp. 1–1, 2020.

S. Wu, Y. Zhai, J. Liu, J. Huang, Z. Jian, B. Wong, and Z. Chen, “Anatomy of a high-performance semi-automatic fine-tuned tolerance on gpus,” in Proceedings of the 27th International Conference on Supercomputing, ser. ICS ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 360–372.

R. Wang, Z. Yang, H. Xu, and L. Lu, “A high-performance batched matrix multiplication framework for gpus under unbalanced input distribution,” The Journal of Supercomputing, vol. 78, no. 2, p. 1741–1758, Jun 2021.

P. Haghi, A. Guo, T. Geng, J. Broaddus, D. Schafer, A. Skjellum, and M. Herbordt, “A reconfigurable compute-in-the-network fpga assistant for high-level collective support with distributed matrix multiply: case study,” IEEE Conference on Field Programmable Technology.

J. de Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible communication avoiding matrix multiplication on fpga with high-level synthesis,” ser. FPGA ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 244–254.

E. H. D’Hollander, “High-level synthesis optimization for blocked floating-point matrix multiplication,” SIGARCH Comput. Archit. News, vol. 44, no. 4, p. 74–79, Jan 2017.

Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong, “Fp-dnn: An automated framework for mapping deep neural networks onto fpgas with rtl-hls hybrid templates,” 04 2017, pp. 152–159.

Downloads

Published

2024-08-20

How to Cite

Athanasiadis, A., Tampouratzis, N., & Papaefstathiou, I. (2024). An Open-source HLS Fully Parameterizable Matrix Multiplication Library for AMD FPGAs. WiPiEC Journal - Works in Progress in Embedded Computing Journal, 10(2). Retrieved from https://wipiec.digitalheritage.me/index.php/wipiecjournal/article/view/62