research-article

Learning to optimize halide with tree search and random programs

Authors:

Riyadh Baghdadi,

Michaël Gharbi,

Benoit Steiner,

Steven Johnson,

Kayvon Fatahalian,

Jonathan Ragan-KelleyAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 38, Issue 4

Article No.: 121, Pages 1 - 12

https://doi.org/10.1145/3306346.3322967

Published: 12 July 2019 Publication History

Abstract

We present a new algorithm to automatically schedule Halide programs for high-performance image processing and deep learning. We significantly improve upon the performance of previous methods, which considered a limited subset of schedules. We define a parameterization of possible schedules much larger than prior methods and use a variant of beam search to search over it. The search optimizes runtime predicted by a cost model based on a combination of new derived features and machine learning. We train the cost model by generating and featurizing hundreds of thousands of random programs and schedules. We show that this approach operates effectively with or without autotuning. It produces schedules which are on average almost twice as fast as the existing Halide autoscheduler without autotuning, or more than twice as fast with, and is the first automatic scheduling algorithm to significantly outperform human experts on average.

Supplementary Material

ZIP File (a121-adams.zip)

Supplemental material

Download
29.26 MB

MP4 File (papers_239.mp4)

Download
177.78 MB

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR (2016). http://arxiv.org/abs/1603.04467

[2]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM.

Digital Library

[3]

Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A Survey on Compiler Autotuning Using Machine Learning. ACM Comput. Surv. 51, 5 (Sept. 2018).

Digital Library

[4]

Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alastair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. 2015. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society.

Digital Library

[5]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM.

Digital Library

[6]

Jiawen Chen, Andrew Adams, Neal Wadhwa, and Samuel W. Hasinof. 2016. Bilateral Guided Upsampling. ACM Trans. Graph. 35, 6 (Nov. 2016).

Digital Library

[7]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR (2015). http://arxiv.org/abs/1512.01274

[8]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: End-to-End Optimization Stack for Deep Learning. CoRR (2018). http://arxiv.org/abs/1802.04799

[9]

Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018b. Learning to Optimize Tensor Programs. CoRR (2018). http://arxiv.org/abs/1805.08166

Digital Library

[10]

Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimization heuristics. In Parallel Architectures and Compilation Techniques (PACT), 2017 26th International Conference on. IEEE.

[11]

Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, et al. 2008. MILEPOST GCC: machine learning based research compiler. In GCC summit.

[12]

Tobias Grosser, Armin Groslinger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letters 22, 4 (2012).

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]

James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling High-level Image Processing Code into Hardware Pipelines. ACM Trans. Graph. (Proc. SIGGRAPH) 33, 4 (July 2014).

Digital Library

[15]

Intel. 2016. Intel(R) Math Kernel Library for Deep Neural Networks. https://github.com/intel/mkl-dnn

[16]

Abhinav Jangda and Uday Bondhugula. 2018. An effective fusion and tile size model for optimizing image processing pipelines. In Symposium on Principles and Practice of Parallel Programming.

Digital Library

[17]

Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.

[18]

Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable Programming for Image Processing and Deep Learning in Halide. ACM Trans. Graph. 37, 4 (July 2018).

Digital Library

[19]

Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. ArXiv e-prints (Aug. 2018). arXiv:cs.DC/1808.07412 https://arxiv.org/pdf/1808.07412.pdf

[20]

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4 (July 2016).

Digital Library

[21]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. SIGARCH Comput. Archit. News 43, 1 (March 2015).

Digital Library

[22]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop.

[23]

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. ACM Trans. Graph. 31, 4 (July 2012).

Digital Library

[24]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not. 48, 6 (June 2013).

Digital Library

[25]

Mohammed Rahman, Louis-Noël Pouchet, and P Sadayappan. 2010. Neural network assisted tile size selection. In International Workshop on Automatic Performance Tuning (IWAPT'2010). Berkeley, CA: Springer Verlag.

[26]

D. Raj. Reddy. 1977. Speech Understanding Systems: A Summary of Results of the Five-Year Research Effort. Department of Computer Science Technical Report. Carnegie Mellon University.

[27]

Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop Transformations Leveraging Hardware Prefetching. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM.

Digital Library

[28]

Savvas Sioutas, Sander Stuijk, Luc Waeijen, Twan Basten, Henk Corporaal, and Lou Somers. 2019. Schedule Synthesis for Halide Pipelines Through Reuse Analysis. ACM Trans. Archit. Code Optim. 16, 2 (April 2019).

Digital Library

[29]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zach DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. CoRR (2018). http://arxiv.org/abs/1802.04730

Cited By

Verma GRaskar SEmani MChapman B(2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
https://doi.org/10.3390/app14020513
Pitchanathan AGrover KGrosser T(2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656452
Huang HChen XZhao J(2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656631
Show More Cited By

Index Terms

Learning to optimize halide with tree search and random programs
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation
      1. Image processing
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages

Recommendations

Automatically scheduling halide image processing pipelines

The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a ...
A Halide-based Synergistic Computing Framework for Heterogeneous Systems

New programming models have been developed to embrace contemporary heterogeneous machines, each of which may contain several types of processors, e.g., CPUs, GPUs, FPGAs and ASICs. Unlike the conventional ones, which use separate programming schemes for ...
Accelerating OpenVX Application Kernels Using Halide Scheduling
Abstract
In this study, we investigate how to use a Domain-Specific Language—Halide to accelerate and optimize OpenVX graphs. Halide is a new high-level image processing pipeline language. It offers developers to separate the program into algorithms and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 38, Issue 4

August 2019

1480 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3306346

Editor:
Olga Sorkine-Hornung
ETH Zurich

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2019

Published in TOG Volume 38, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Toyota
NSF/Intel Partnership on Computer Assisted Programming for Heterogeneous Architectures

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

126
Total Citations
View Citations
1,397
Total Downloads

Downloads (Last 12 months)256
Downloads (Last 6 weeks)20

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Verma GRaskar SEmani MChapman B(2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
https://doi.org/10.3390/app14020513
Pitchanathan AGrover KGrosser T(2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656452
Huang HChen XZhao J(2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656631
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Maeng KLucia BRodríguez GSadayappan PSukumaran-Rajam A(2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641564
Zhao JXu JDi PNie WHu JYi YYang SGeng ZZhang RLi BGan ZJin X(2024)Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine RelationsACM Transactions on Computer Systems10.1145/363530541:1-4(1-45)Online publication date: 15-Jan-2024
https://dl.acm.org/doi/10.1145/3635305
Kœhler TGoens ABhat SGrosser TTrinder PSteuwer M(2024)Guided Equality SaturationProceedings of the ACM on Programming Languages10.1145/36329008:POPL(1727-1758)Online publication date: 5-Jan-2024
https://doi.org/10.1145/3632900
Hu HSu JZhao JPeng YZhu YLin HWu C(2024)CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor ProgramsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629572(1054-1074)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629572
Willis BShrivastava AMack JDave SChakrabarti CBrunhaver J(2024)Cyclebite: Extracting Task Graphs From Unstructured Compute-ProgramsIEEE Transactions on Computers10.1109/TC.2023.332750473:1(221-234)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TC.2023.3327504
Wang FShen MDing YXiao N(2024)Soter: Analytical Tensor-Architecture Modeling and Automatic Tensor Program Tuning for Spatial Accelerators2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00076(991-1004)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00076
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents