research-article

Open access

Schedule Synthesis for Halide Pipelines on GPUs

Authors:

Savvas Sioutas,

Henk Corporaal,

Lou SomersAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 3

Article No.: 23, Pages 1 - 25

https://doi.org/10.1145/3406117

Published: 03 August 2020 Publication History

All formats PDF

Abstract

The Halide DSL and compiler have enabled high-performance code generation for image processing pipelines targeting heterogeneous architectures through the separation of algorithmic description and optimization schedule. However, automatic schedule generation is currently only possible for multi-core CPU architectures. As a result, expert knowledge is still required when optimizing for platforms with GPU capabilities. In this work, we extend the current Halide Autoscheduler with novel optimization passes to efficiently generate schedules for CUDA-based GPU architectures. We evaluate our proposed method across a variety of applications and show that it can achieve performance competitive with that of manually tuned Halide schedules, or in many cases even better performance. Experimental results show that our schedules are on average 10% faster than manual schedules and over 2× faster than previous autoscheduling attempts.

References

[1]

Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize Halide with tree search and random programs. ACM Trans. Graph. 38, 4, Article 121 (July 2019), 12 pages.

Digital Library

[2]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 303--316.

Digital Library

[3]

S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU, and CPU in image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications. 126--131.

[4]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 193--205. Retrieved from http://dl.acm.org/citation.cfm?id=3314872.3314896.

[5]

T. Besard, C. Foket, and B. De Sutter. 2019. Effective extensible programming: Unleashing Julia on GPUs. IEEE Trans. Parallel Distrib. Syst. 30, 4 (Apr. 2019), 827--841.

Digital Library

[6]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Berkeley, CA, 579--594. Retrieved from http://dl.acm.org/citation.cfm?id=3291168.3291211.

[7]

NVIDIA Corporation. 2019. NVIDIA Nsight Compute. Retrieved from https://developer.nvidia.com/nsight-compute-2019_5 version 2019.5.0.

[8]

Halide. 2018. Halide GitHub Repository (MIT License). Retrieved from https://github.com/halide/Halide.

[9]

Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). Association for Computing Machinery, New York, NY, 311--320.

[10]

Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH) 37, 4 (2018), 139:1--139:13.

Digital Library

[11]

Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July 1996), 424--453.

Digital Library

[12]

Richard Membarth, Oliver Reiche, Frank Hannig, Jurgen Teich, Mario Korner, and Wieland Eckert. 2016. HIPAcc: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst. 27, 1 (Jan. 2016), 210--224.

Digital Library

[13]

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling Halide image processing pipelines. ACM Trans. Graph. 35, 4, Article 83 (July 2016), 11 pages.

Digital Library

[14]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. SIGARCH Comput. Archit. News 43, 1 (Mar. 2015), 429--443.

Digital Library

[15]

Nvidia. 2019. Cuda occupancy calculator. Retrieved from https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html.

[16]

Nirmal Prajapati, Waruna Ranasinghe, Sanjay Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, accurate, analytical time modeling and optimal tile size selection for GPGPU stencils. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). ACM, New York, NY, 163--177.

Digital Library

[17]

Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, and Victor Eruhimov. 2012. Real-time computer vision with OpenCV. Commun. ACM 55, 6 (June 2012), 61--69.

Digital Library

[18]

Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. Automatic kernel fusion for image processing DSLs. In Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems (SCOPES’18). Association for Computing Machinery, New York, NY, 76--85.

Digital Library

[19]

Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2019. From loop fusion to kernel fusion: A domain-specific approach to locality optimization. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 242--253. Retrieved from p://dl.acm.org/citation.cfm?id=3314872.3314901.

[20]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530.

Digital Library

[21]

Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU’15). ACM, New York, NY, 109--120.

Digital Library

[22]

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource conscious reuse-driven tiling for GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). Association for Computing Machinery, New York, NY, 99--111.

[23]

Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop transformations leveraging hardware prefetching. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 254--264.

[24]

Savvas Sioutas, Sander Stuijk, Luc Waeijen, Twan Basten, Henk Corporaal, and Lou Somers. 2019. Schedule synthesis for Halide pipelines through reuse analysis. ACM Trans. Archit. Code Optim. 16, 2, Article 10 (Apr. 2019), 22 pages.

Digital Library

[25]

Lawrence Berkeley National Laboratory, the Regents of the University of California. 2019. Empirical Roofline Tool (ERT). Retrieved from https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/.

[26]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR abs/1802.04730 (2018).

[27]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages.

Digital Library

[28]

M. Wahib and N. Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 191--202.

[29]

Guibin Wang, YiSong Lin, and Wei Yi. 2010. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the IEEE/ACM International Conference on Green Computing and Communications 8 International Conference on Cyber, Physical and Social Computing (GREENCOM-CPSCOM’10). IEEE Computer Society, 344--350.

Digital Library

[30]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76.

Digital Library

Cited By

Kœhler TGoens ABhat SGrosser TTrinder PSteuwer M(2024)Guided Equality SaturationProceedings of the ACM on Programming Languages10.1145/36329008:POPL(1727-1758)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632900
Sun WLi AStuijk SCorporaal H(2024)How Much Can We Gain From Tensor Kernel Fusion on GPUs?IEEE Access10.1109/ACCESS.2024.341147312(126135-126144)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3411473
Kanetaka YTakagi HMaeda YFukushima N(2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3345660
Show More Cited By

Index Terms

Schedule Synthesis for Halide Pipelines on GPUs
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Domain specific languages

Recommendations

Automatically scheduling halide image processing pipelines

The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a ...
Schedule Synthesis for Halide Pipelines through Reuse Analysis

Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to meet real-time constraints. The inherently complex structure found in most image-processing pipelines, the ...
Accelerating OpenVX Application Kernels Using Halide Scheduling
Abstract
In this study, we investigate how to use a Domain-Specific Language—Halide to accelerate and optimize OpenVX graphs. Halide is a new high-level image processing pipeline language. It offers developers to separate the program into algorithms and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 3

September 2020

200 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3415154

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 August 2020

Accepted: 01 June 2020

Revised: 01 May 2020

Received: 01 November 2019

Published in TACO Volume 17, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
2,539
Total Downloads

Downloads (Last 12 months)377
Downloads (Last 6 weeks)38

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kœhler TGoens ABhat SGrosser TTrinder PSteuwer M(2024)Guided Equality SaturationProceedings of the ACM on Programming Languages10.1145/36329008:POPL(1727-1758)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632900
Sun WLi AStuijk SCorporaal H(2024)How Much Can We Gain From Tensor Kernel Fusion on GPUs?IEEE Access10.1109/ACCESS.2024.341147312(126135-126144)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3411473
Kanetaka YTakagi HMaeda YFukushima N(2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3345660
Barreiros WKong JFerreira RTeodoro G(2024)A hierarchical data partitioning strategy for irregular applications: a case study in digital pathologyCluster Computing10.1007/s10586-024-04728-528:1Online publication date: 17-Oct-2024
https://doi.org/10.1007/s10586-024-04728-5
Wang JSuo QYan DYue C(2023)CustomHalide – A new plugin of clang for loop optimizationProceedings of the 2023 9th International Conference on Computing and Artificial Intelligence10.1145/3594315.3594372(557-567)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1145/3594315.3594372
Anderson LAdams AMa KLi TJin TRagan-Kelley J(2021)Efficient automatic scheduling of imaging and vision pipelines for the GPUProceedings of the ACM on Programming Languages10.1145/34854865:OOPSLA(1-28)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3485486
Yu CShi XShen HChen ZLi MWang Y(2021)LorienProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486973(18-32)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3486973
Huff DDai SHanrahan P(2021)Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications on FPGAs2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM51124.2021.00030(186-194)Online publication date: May-2021
https://doi.org/10.1109/FCCM51124.2021.00030
Waeijen LSioutas SPeemen MLindwer MCorporaal H(2021)ConvFusion: A Model for Layer Fusion in Convolutional Neural NetworksIEEE Access10.1109/ACCESS.2021.31349309(168245-168267)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3134930
De SMohamed SGoswami DCorporaal H(2020)Approximation-Aware Design of an Image-Based Control SystemIEEE Access10.1109/ACCESS.2020.30230478(174568-174586)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3023047

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents