research-article

General-Purpose Computing with Soft GPUs on FPGAs

Authors:

Muhammed Al Kadi,

Benedikt Janssen,

Michael HuebnerAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 11, Issue 1

Article No.: 5, Pages 1 - 22

https://doi.org/10.1145/3173548

Published: 24 January 2018 Publication History

Abstract

Using field-programmable gate arrays (FPGAs) as a substrate to deploy soft graphics processing units (GPUs) would enable offering the FPGA compute power in a very flexible GPU-like tool flow. Application-specific adaptations like selective hardening of floating-point operations and instruction set subsetting would mitigate the high area and power demands of soft GPUs. This work explores the capabilities and limitations of soft General Purpose Computing on GPUs (GPGPU) for both fixed- and floating point arithmetic. For this purpose, we have developed FGPU: a configurable, scalable, and portable GPU architecture designed especially for FPGAs. FGPU is open-source and implemented entirely in RTL. It can be programmed in OpenCL and controlled through a Python API. This article introduces its hardware architecture as well as its tool flow. We evaluated the proposed GPGPU approach against multiple other solutions. In comparison to homogeneous Multi-Processor System-On-Chips (MPSoCs), we found that using a soft GPU is a Pareto-optimal solution regarding throughput per area and energy consumption. On average, FGPU has a 2.9× better compute density and 11.2× less energy consumption than a single MicroBlaze processor when computing in IEEE-754 floating-point format. An average speedup of about 4× over the ARM Cortex-A9 supported with the NEON vector co-processor has been measured for fixed- or floating-point benchmarks. In addition, the biggest FGPU cores we could implement on a Xilinx Zynq-7000 System-On-Chip (SoC) can deliver similar performance to equivalent implementations with High-Level Synthesis (HLS).

References

[1]

A. Al-Dujaili et al. 2012. Guppy: A GPU-like soft-core processor. In Proceedings of the International Conference on Field-Programmable Technology (FPT’12). 57--60.

[2]

Muhammed Al Kadi, Benedikt Janssen, and Michael Huebner. 2016. FGPU: An SIMT-architecture for FPGAs (FPGA’16). ACM, New York, NY, 254--263.

Digital Library

[3]

Muhammed Al Kadi, Benedikt Janssen, and Michael Huebner. 2017. Floating-point arithmetic using GPGPU on FPGAs. In Proceedings of the 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’17).

[4]

Altera Corp. Dec. 2015. Stratix 10 Device Overview. Initial Release.

[5]

AMD, Inc. 2017. ADM Accelerated Parallel Processing SDK v3.0. Retrieved from http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/.

[6]

K. Andryc, M. Merchant, and R. Tessier. 2013. FlexGrip: A soft GPGPU for FPGAs. In Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT’13). 230--237.

[7]

K. Andryc, T. Thomas, and R. Tessier. 2016. Soft GPGPUs for embedded FPGAs: An architectural evaluation. In Proceedings of the 2016 Second Workshop on Overlay Architectures for FPGAs (OLAF’16).

[8]

Raghuraman Balasubramanian et al. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM Trans. Archit. Code Optim. 12, 2, Article 21 (June 2015).

Digital Library

[9]

J. Bush, P. Dexter, and T. N. Miller. 2015. Nyami: A synthesizable GPU architectural model for general-purpose and graphics-specific workloads. In Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’15). 173--182.

[10]

D. W. Chang et al. 2010. ERCBench: An open-source benchmark suite for embedded and reconfigurable computing. In Proceedings of the 2010 International Conference on Field Programmable Logic and Applications. 408--413. 1946-147X

Digital Library

[11]

Diego Valverde. 2011. Theia: Ray Graphic Processing Unit. Retrieved from opencores.com/project,theia_gpu.

[12]

M. Al Kadi and M. Huebner. 2016. Integer computations with soft GPGPU on FPGAs. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT’16). 28--35.

[13]

Nachiket Kapre. 2016. Optimizing soft vector processing in FPGA-based embedded systems. ACM Trans. Reconfigurable Technol. Syst. 9, 3, Article 17 (May 2016).

Digital Library

[14]

Khronos Group. 2012. OpenCL 1.2 Specification. https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.

[15]

J. Kingyens and J. Gregory Steffan. 2010. A GPU-inspired soft processor for high-throughput acceleration. In Proceedings of the 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW’10). 1--8.

[16]

C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). 75--86.

Digital Library

[17]

T. Miller. 2016. OpenShader: Open Architecture GPU Simulator and Implementation. Retrieved from sourceforge.net/projects/openshader.

[18]

Muhammed Al Kadi. 2017. FGPU Demo using PYNQ on the Xilinx ZC706. Retrieved from https://github.com/malkadi/FGPU_IPython.

[19]

Muhammed Al Kadi. 2017. The FGPU Project. Retrieved from https://github.com/malkadi/FGPU.

[20]

R. Rashid, J. G. Steffan, and V. Betz. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT’14). 20--27.

[21]

A. Severance and G. G. F. Lemieux. 2013. Embedded Supercomputing in FPGAs with the vectorblox MXP matrix processor. In Proceedings of the 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’13). 1--10.

Digital Library

[22]

VectorBlox Computing, Inc. 2017. The MXP Vector Matrix Processor Repository. Retrieved from https://github.com/VectorBlox/mxp.

[23]

Xilinx, Inc. 2015. AXI DMA, LogiCORE IP Product Guide (PG021, v7.1). https://www.xilinx.com/support/documentation/ipdocumentation/axidma/v71/pg021axidma.pdf.

[24]

Xilinx, Inc. 2015. Floating-Point Operator v7.1, LogiCORE IP Product Guide (PG060). https://www.xilinx.com/support/documentation/ipdocumentation/floatingpoint/v71/pg060-floating-point.pdf.

[25]

Xilinx, Inc. 2016. 7 Series FPGAs Configurable Logic Block v1.8, (UG474). https://www.xilinx.com/support/documentation/userguides/ug4747SeriesCLB.pdf.

[26]

Xilinx, Inc. 2016. The PYNQ Project. http://www.pynq.io {Online; accessed 15-Jan-2017}.

[27]

Xilinx, Inc. 2016. UltraScale Architecture and Product Overview (v3.1), DS890. https://www.xilinx.com/support/documentation/datasheets/ds890-ultrascale-overview.pdf.

[28]

Xilinx, Inc. 2016. Zynq-7000 All Programmable SoC, Technical Reference Manual (UG585, v1.12.1). https://www.xilinx.com/support/documentation/userguides/ug585-Zynq-7000-TRM.pdf.

[29]

Xilinx, Inc. 2016. SDAccel Development Environment Methodology Guide, Performance Optimization (UG1207, v2.0). https://www.xilinx.com/support/documentation/swmanuals/ug1207-sdaccel-performance-optimization.pdf. (August 2016). Ch. 7.

[30]

Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2009. Fine-grain performance scaling of soft vector processors. In Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’09). ACM, New York, NY, 97--106.

Digital Library

Cited By

Ahn CJeong SCooper LParnenzini NKim H(2024)Comparative Analysis of Executing GPU Applications on FPGA: HLS vs. Soft GPU Approaches2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00123(634-641)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00123
Fricke FBrandalero MLiehr SKern SMeyer KKowarik SHierzegger RWesterdick SMaiwald MHubner M(2022)Artificial Intelligence for Mass Spectrometry and Nuclear Magnetic Resonance Spectroscopy Using a Novel Data Augmentation MethodIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.313137110:1(87-98)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TETC.2021.3131371
Tharun Adhitya KKalra S(2022)Gaming for Better Psychological Health: A Solution Based on the FPGA Zynq 70002022 8th International Conference on Signal Processing and Communication (ICSC)10.1109/ICSC56524.2022.10009066(602-607)Online publication date: 1-Dec-2022
https://doi.org/10.1109/ICSC56524.2022.10009066
Show More Cited By

Index Terms

General-Purpose Computing with Soft GPUs on FPGAs

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 11, Issue 1

Special Section on FCCM 2016 and Regular Papers

March 2018

183 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3178391

Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2018

Accepted: 01 December 2017

Revised: 01 November 2017

Received: 01 June 2017

Published in TRETS Volume 11, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
753
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)3

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ahn CJeong SCooper LParnenzini NKim H(2024)Comparative Analysis of Executing GPU Applications on FPGA: HLS vs. Soft GPU Approaches2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00123(634-641)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00123
Fricke FBrandalero MLiehr SKern SMeyer KKowarik SHierzegger RWesterdick SMaiwald MHubner M(2022)Artificial Intelligence for Mass Spectrometry and Nuclear Magnetic Resonance Spectroscopy Using a Novel Data Augmentation MethodIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.313137110:1(87-98)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TETC.2021.3131371
Tharun Adhitya KKalra S(2022)Gaming for Better Psychological Health: A Solution Based on the FPGA Zynq 70002022 8th International Conference on Signal Processing and Communication (ICSC)10.1109/ICSC56524.2022.10009066(602-607)Online publication date: 1-Dec-2022
https://doi.org/10.1109/ICSC56524.2022.10009066
Nannipieri PGiuffrida GDiana LPanicacci SZulberti LFanucci LHernandez HHubner M(2022)ICU4SAT: A General-Purpose Reconfigurable Instrument Control Unit Based on Open Source Components2022 IEEE Aerospace Conference (AERO)10.1109/AERO53065.2022.9843414(1-9)Online publication date: 5-Mar-2022
https://doi.org/10.1109/AERO53065.2022.9843414
Benevenuti FGonçalves MPereira EVaz RGonçalez OBastos RLetiche MKastensmidt FAzambuja J(2022)Investigating the reliability impacts of neutron-induced soft errors in aerial image classification CNNs implemented in a softcore SRAM-based FPGA GPUMicroelectronics Reliability10.1016/j.microrel.2022.114738138(114738)Online publication date: Nov-2022
https://doi.org/10.1016/j.microrel.2022.114738
Goncalves MCondia JReorda MSterpone LAzambuja J(2022)Evaluating low-level software-based hardening techniques for configurable GPU architecturesThe Journal of Supercomputing10.1007/s11227-021-04154-z78:6(8081-8105)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1007/s11227-021-04154-z
Silva BLima AArias-Garcia JHuebner MYudi J(2021)A Manycore Vision Processor for Real-Time Smart CamerasSensors10.3390/s2121713721:21(7137)Online publication date: 27-Oct-2021
https://doi.org/10.3390/s21217137
Ma RHsu JTan TNurvitadhi ESheffield DPelt RLanghammer MSim JDasu AChiou D(2021)Specializing FGPU for Persistent Deep LearningACM Transactions on Reconfigurable Technology and Systems10.1145/345788614:2(1-23)Online publication date: 15-Jul-2021
https://dl.acm.org/doi/10.1145/3457886
Benevenuti FGoncalves MJunior EVaz RGoncalez OAzambuja JKastensmidt F(2021)Neutron-induced Faults on CNN for Aerial Image Classification on SRAM-based FPGA Using Softcore GPU and HLS2021 21th European Conference on Radiation and Its Effects on Components and Systems (RADECS)10.1109/RADECS53308.2021.9954517(1-4)Online publication date: Sep-2021
https://doi.org/10.1109/RADECS53308.2021.9954517
Brandalero MVeleski MHernandez HAli MLe Jeune LGoedeme TMentens NVandendriessche JLhoest LDa Silva BTouhafi AGoehringer DHubner M(2021)AITIA: Embedded AI Techniques for Industrial Applications2021 31st International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL53798.2021.00071(374-375)Online publication date: Aug-2021
https://doi.org/10.1109/FPL53798.2021.00071
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents