research-article

Open access

Automatically scheduling halide image processing pipelines

Authors:

Ravi Teja Mullapudi,

Dillon Sharlet,

Jonathan Ragan-Kelley,

Kayvon FatahalianAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 35, Issue 4

Article No.: 83, Pages 1 - 11

https://doi.org/10.1145/2897824.2925952

Published: 11 July 2016 Publication History

Abstract

The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a schedule), and the Halide compiler carries out the mechanical task of generating platform-specific code that implements the schedule. Unfortunately, designing high-performance schedules for complex image processing pipelines requires substantial knowledge of modern hardware architecture and code-optimization techniques. In this paper we provide an algorithm for automatically generating high-performance schedules for Halide programs. Our solution extends the function bounds analysis already present in the Halide compiler to automatically perform locality and parallelism-enhancing global program transformations typical of those employed by expert Halide developers. The algorithm does not require costly (and often impractical) auto-tuning, and, in seconds, generates schedules for a broad set of image processing benchmarks that are performance-competitive with, and often better than, schedules manually authored by expert Halide developers on server and mobile CPUs, as well as GPUs.

Supplementary Material

MP4 File (a83.mp4)

Download
140.47 MB

References

[1]

Adams, A., Talvala, E.-V., Park, S. H., Jacobs, D. E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H. P. A., Matusik, W., Pulli, K., Horowitz, M., and Levoy, M. 2010. The frankencamera: An experimental platform for computational photography. ACM Transactions on Graphics 29, 4 (July), 29:1--29:12.

Digital Library

[2]

Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J., O'Reilly, U.-M., and Amarasinghe, S. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, ACM, 303--316.

Digital Library

[3]

Chen, J., Paris, S., and Durand, F. 2007. Real-time edge-aware image processing with the bilateral grid. ACM Transactions on Graphics 26, 3 (July), 103:1--103:9.

Digital Library

[4]

Darbon, J., Cunha, A., Chan, T. F., Osher, S., and Jensen, G. J. 2008. Fast nonlocal filtering applied to electron cryomicroscopy. In Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008. 5th IEEE International Symposium on, IEEE, 1331--1334.

[5]

Farbman, Z., Fattal, R., and Lischinski, D. 2011. Convolution pyramids. ACM Transactions on Graphics 30, 6 (Dec.), 175:1--175:8.

Digital Library

[6]

Harris, C., and Stephens, M. 1988. A combined corner and edge detector. In In Proc. of Fourth Alvey Vision Conference, 147--151.

[7]

Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., and Hanrahan, P. 2014. Darkroom: compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4 (July), 144:1--144:11.

Digital Library

[8]

Hegarty, J., Daly, R., DeVito, Z., Ragan-Kelley, J., Horowitz, M., and Hanrahan, P. 2016. Rigel: Flexible multi-rate image processing hardware. ACM Transactions on Graphics 36, 4 (July).

Digital Library

[9]

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.

[10]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097--1105.

[11]

Mullapudi, R. T., Vasista, V., and Bondhugula, U. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 429--443.

Digital Library

[12]

Paris, S., Hasinoff, S. W., and Kautz, J. 2011. Local Laplacian filters: edge-aware image processing with a Laplacian pyramid. ACM Transactions on Graphics 30, 4 (July), 68:1--68:12.

Digital Library

[13]

Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics 31, 4 (July), 32:1--32:12.

Digital Library

[14]

Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 519--530.

Digital Library

[15]

Ragan-Kelley, J., Adams, A., and Sharlet, D. 2015. An introduction to Halide. In ACM SIGGRAPH 2015 Courses, ACM, 3:1--3:160.

Digital Library

[16]

Rhemann, C., Hosni, A., Bleyer, M., Rother, C., and Gelautz, M. 2011. Fast cost-volume filtering for visual correspondence and beyond. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, 3017--3024.

Digital Library

[17]

Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Schmitz AMiller JBurak SMüller M(2024)Parallel Pattern Language Code GenerationProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649245(32-41)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649245
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Show More Cited By

Index Terms

Automatically scheduling halide image processing pipelines
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces

Recommendations

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Accelerating OpenVX Application Kernels Using Halide Scheduling
Abstract
In this study, we investigate how to use a Domain-Specific Language—Halide to accelerate and optimize OpenVX graphs. Halide is a new high-level image processing pipeline language. It offers developers to separate the program into algorithms and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 35, Issue 4

July 2016

1396 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/2897824

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Published in TOG Volume 35, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

140
Total Citations
View Citations
2,556
Total Downloads

Downloads (Last 12 months)288
Downloads (Last 6 weeks)29

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Schmitz AMiller JBurak SMüller M(2024)Parallel Pattern Language Code GenerationProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649245(32-41)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649245
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Maeng KLucia BRodríguez GSadayappan PSukumaran-Rajam A(2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641564
Hu MVenkatram ABiswas SMarimuthu BHou BOliaro GWang HZheng LMiao XZhai JJia ZTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Optimal Kernel Orchestration for Tensor Programs with KorchProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651383(755-769)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651383
Tan ZZhu ZMa KTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication OptimizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624865(69-84)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624865
Xia CZhao JSun QWang ZWen YYu TFeng XCui HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624858
Willis BShrivastava AMack JDave SChakrabarti CBrunhaver J(2024)Cyclebite: Extracting Task Graphs From Unstructured Compute-ProgramsIEEE Transactions on Computers10.1109/TC.2023.332750473:1(221-234)Online publication date: Jan-2024
https://doi.org/10.1109/TC.2023.3327504
Feng KKong TKoul KMelchert JCarsello ALiu QNyengele GStrange MZhang KNayak ASetter JThomas JSreedhar KChen PBhagdikar NMyers ZD’Agostino BJoshi PRichardson STorng CHorowitz MRaina P(2024)Amber: A 16-nm System-on-Chip With a Coarse- Grained Reconfigurable Array for Flexible Acceleration of Dense Linear AlgebraIEEE Journal of Solid-State Circuits10.1109/JSSC.2023.331311659:3(947-959)Online publication date: Mar-2024
https://doi.org/10.1109/JSSC.2023.3313116
Li QEdahiro M(2024)Template-Based Automatic Library Function Generation with Halide for Compute-Intensive Simulink Models2024 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)10.1109/COOLCHIPS61292.2024.10531173(1-6)Online publication date: 17-Apr-2024
https://doi.org/10.1109/COOLCHIPS61292.2024.10531173
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents