skip to main content
research-article
Open access

Schedule Synthesis for Halide Pipelines on GPUs

Published: 03 August 2020 Publication History

Abstract

The Halide DSL and compiler have enabled high-performance code generation for image processing pipelines targeting heterogeneous architectures through the separation of algorithmic description and optimization schedule. However, automatic schedule generation is currently only possible for multi-core CPU architectures. As a result, expert knowledge is still required when optimizing for platforms with GPU capabilities. In this work, we extend the current Halide Autoscheduler with novel optimization passes to efficiently generate schedules for CUDA-based GPU architectures. We evaluate our proposed method across a variety of applications and show that it can achieve performance competitive with that of manually tuned Halide schedules, or in many cases even better performance. Experimental results show that our schedules are on average 10% faster than manual schedules and over 2× faster than previous autoscheduling attempts.

References

[1]
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize Halide with tree search and random programs. ACM Trans. Graph. 38, 4, Article 121 (July 2019), 12 pages.
[2]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 303--316.
[3]
S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU, and CPU in image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications. 126--131.
[4]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 193--205. Retrieved from http://dl.acm.org/citation.cfm?id=3314872.3314896.
[5]
T. Besard, C. Foket, and B. De Sutter. 2019. Effective extensible programming: Unleashing Julia on GPUs. IEEE Trans. Parallel Distrib. Syst. 30, 4 (Apr. 2019), 827--841.
[6]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Berkeley, CA, 579--594. Retrieved from http://dl.acm.org/citation.cfm?id=3291168.3291211.
[7]
NVIDIA Corporation. 2019. NVIDIA Nsight Compute. Retrieved from https://developer.nvidia.com/nsight-compute-2019_5 version 2019.5.0.
[8]
Halide. 2018. Halide GitHub Repository (MIT License). Retrieved from https://github.com/halide/Halide.
[9]
Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). Association for Computing Machinery, New York, NY, 311--320.
[10]
Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH) 37, 4 (2018), 139:1--139:13.
[11]
Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July 1996), 424--453.
[12]
Richard Membarth, Oliver Reiche, Frank Hannig, Jurgen Teich, Mario Korner, and Wieland Eckert. 2016. HIPAcc: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst. 27, 1 (Jan. 2016), 210--224.
[13]
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling Halide image processing pipelines. ACM Trans. Graph. 35, 4, Article 83 (July 2016), 11 pages.
[14]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. SIGARCH Comput. Archit. News 43, 1 (Mar. 2015), 429--443.
[15]
Nvidia. 2019. Cuda occupancy calculator. Retrieved from https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html.
[16]
Nirmal Prajapati, Waruna Ranasinghe, Sanjay Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, accurate, analytical time modeling and optimal tile size selection for GPGPU stencils. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). ACM, New York, NY, 163--177.
[17]
Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, and Victor Eruhimov. 2012. Real-time computer vision with OpenCV. Commun. ACM 55, 6 (June 2012), 61--69.
[18]
Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. Automatic kernel fusion for image processing DSLs. In Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems (SCOPES’18). Association for Computing Machinery, New York, NY, 76--85.
[19]
Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2019. From loop fusion to kernel fusion: A domain-specific approach to locality optimization. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 242--253. Retrieved from p://dl.acm.org/citation.cfm?id=3314872.3314901.
[20]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530.
[21]
Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU’15). ACM, New York, NY, 109--120.
[22]
Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource conscious reuse-driven tiling for GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). Association for Computing Machinery, New York, NY, 99--111.
[23]
Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop transformations leveraging hardware prefetching. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 254--264.
[24]
Savvas Sioutas, Sander Stuijk, Luc Waeijen, Twan Basten, Henk Corporaal, and Lou Somers. 2019. Schedule synthesis for Halide pipelines through reuse analysis. ACM Trans. Archit. Code Optim. 16, 2, Article 10 (Apr. 2019), 22 pages.
[25]
Lawrence Berkeley National Laboratory, the Regents of the University of California. 2019. Empirical Roofline Tool (ERT). Retrieved from https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/.
[26]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR abs/1802.04730 (2018).
[27]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages.
[28]
M. Wahib and N. Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 191--202.
[29]
Guibin Wang, YiSong Lin, and Wei Yi. 2010. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the IEEE/ACM International Conference on Green Computing and Communications 8 International Conference on Cyber, Physical and Social Computing (GREENCOM-CPSCOM’10). IEEE Computer Society, 344--350.
[30]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76.

Cited By

View all
  • (2024)Guided Equality SaturationProceedings of the ACM on Programming Languages10.1145/36329008:POPL(1727-1758)Online publication date: 5-Jan-2024
  • (2024)How Much Can We Gain From Tensor Kernel Fusion on GPUs?IEEE Access10.1109/ACCESS.2024.341147312(126135-126144)Online publication date: 2024
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 3
September 2020
200 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3415154
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 August 2020
Accepted: 01 June 2020
Revised: 01 May 2020
Received: 01 November 2019
Published in TACO Volume 17, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Halide
  3. Loop optimizations
  4. image processing
  5. scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)377
  • Downloads (Last 6 weeks)38
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Guided Equality SaturationProceedings of the ACM on Programming Languages10.1145/36329008:POPL(1727-1758)Online publication date: 5-Jan-2024
  • (2024)How Much Can We Gain From Tensor Kernel Fusion on GPUs?IEEE Access10.1109/ACCESS.2024.341147312(126135-126144)Online publication date: 2024
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2024)A hierarchical data partitioning strategy for irregular applications: a case study in digital pathologyCluster Computing10.1007/s10586-024-04728-528:1Online publication date: 17-Oct-2024
  • (2023)CustomHalide – A new plugin of clang for loop optimizationProceedings of the 2023 9th International Conference on Computing and Artificial Intelligence10.1145/3594315.3594372(557-567)Online publication date: 17-Mar-2023
  • (2021)Efficient automatic scheduling of imaging and vision pipelines for the GPUProceedings of the ACM on Programming Languages10.1145/34854865:OOPSLA(1-28)Online publication date: 15-Oct-2021
  • (2021)LorienProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486973(18-32)Online publication date: 1-Nov-2021
  • (2021)Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications on FPGAs2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM51124.2021.00030(186-194)Online publication date: May-2021
  • (2021)ConvFusion: A Model for Layer Fusion in Convolutional Neural NetworksIEEE Access10.1109/ACCESS.2021.31349309(168245-168267)Online publication date: 2021
  • (2020)Approximation-Aware Design of an Image-Based Control SystemIEEE Access10.1109/ACCESS.2020.30230478(174568-174586)Online publication date: 2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media