skip to main content
research-article

Learning to optimize halide with tree search and random programs

Published: 12 July 2019 Publication History
  • Get Citation Alerts
  • Abstract

    We present a new algorithm to automatically schedule Halide programs for high-performance image processing and deep learning. We significantly improve upon the performance of previous methods, which considered a limited subset of schedules. We define a parameterization of possible schedules much larger than prior methods and use a variant of beam search to search over it. The search optimizes runtime predicted by a cost model based on a combination of new derived features and machine learning. We train the cost model by generating and featurizing hundreds of thousands of random programs and schedules. We show that this approach operates effectively with or without autotuning. It produces schedules which are on average almost twice as fast as the existing Halide autoscheduler without autotuning, or more than twice as fast with, and is the first automatic scheduling algorithm to significantly outperform human experts on average.

    Supplementary Material

    ZIP File (a121-adams.zip)
    Supplemental material
    MP4 File (papers_239.mp4)

    References

    [1]
    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR (2016). http://arxiv.org/abs/1603.04467
    [2]
    Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM.
    [3]
    Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A Survey on Compiler Autotuning Using Machine Learning. ACM Comput. Surv. 51, 5 (Sept. 2018).
    [4]
    Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alastair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. 2015. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society.
    [5]
    Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM.
    [6]
    Jiawen Chen, Andrew Adams, Neal Wadhwa, and Samuel W. Hasinof. 2016. Bilateral Guided Upsampling. ACM Trans. Graph. 35, 6 (Nov. 2016).
    [7]
    Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR (2015). http://arxiv.org/abs/1512.01274
    [8]
    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: End-to-End Optimization Stack for Deep Learning. CoRR (2018). http://arxiv.org/abs/1802.04799
    [9]
    Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018b. Learning to Optimize Tensor Programs. CoRR (2018). http://arxiv.org/abs/1805.08166
    [10]
    Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimization heuristics. In Parallel Architectures and Compilation Techniques (PACT), 2017 26th International Conference on. IEEE.
    [11]
    Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, et al. 2008. MILEPOST GCC: machine learning based research compiler. In GCC summit.
    [12]
    Tobias Grosser, Armin Groslinger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letters 22, 4 (2012).
    [13]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [14]
    James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling High-level Image Processing Code into Hardware Pipelines. ACM Trans. Graph. (Proc. SIGGRAPH) 33, 4 (July 2014).
    [15]
    Intel. 2016. Intel(R) Math Kernel Library for Deep Neural Networks. https://github.com/intel/mkl-dnn
    [16]
    Abhinav Jangda and Uday Bondhugula. 2018. An effective fusion and tile size model for optimizing image processing pipelines. In Symposium on Principles and Practice of Parallel Programming.
    [17]
    Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
    [18]
    Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable Programming for Image Processing and Deep Learning in Halide. ACM Trans. Graph. 37, 4 (July 2018).
    [19]
    Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. ArXiv e-prints (Aug. 2018). arXiv:cs.DC/1808.07412 https://arxiv.org/pdf/1808.07412.pdf
    [20]
    Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4 (July 2016).
    [21]
    Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. SIGARCH Comput. Archit. News 43, 1 (March 2015).
    [22]
    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop.
    [23]
    Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. ACM Trans. Graph. 31, 4 (July 2012).
    [24]
    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not. 48, 6 (June 2013).
    [25]
    Mohammed Rahman, Louis-Noël Pouchet, and P Sadayappan. 2010. Neural network assisted tile size selection. In International Workshop on Automatic Performance Tuning (IWAPT'2010). Berkeley, CA: Springer Verlag.
    [26]
    D. Raj. Reddy. 1977. Speech Understanding Systems: A Summary of Results of the Five-Year Research Effort. Department of Computer Science Technical Report. Carnegie Mellon University.
    [27]
    Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop Transformations Leveraging Hardware Prefetching. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM.
    [28]
    Savvas Sioutas, Sander Stuijk, Luc Waeijen, Twan Basten, Henk Corporaal, and Lou Somers. 2019. Schedule Synthesis for Halide Pipelines Through Reuse Analysis. ACM Trans. Archit. Code Optim. 16, 2 (April 2019).
    [29]
    Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zach DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. CoRR (2018). http://arxiv.org/abs/1802.04730

    Cited By

    View all
    • (2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
    • (2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
    • (2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 38, Issue 4
    August 2019
    1480 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/3306346
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 July 2019
    Published in TOG Volume 38, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. halide
    2. optimizing compilers

    Qualifiers

    • Research-article

    Funding Sources

    • Toyota
    • NSF/Intel Partnership on Computer Assisted Programming for Heterogeneous Architectures

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)256
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
    • (2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
    • (2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
    • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
    • (2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
    • (2024)Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine RelationsACM Transactions on Computer Systems10.1145/363530541:1-4(1-45)Online publication date: 15-Jan-2024
    • (2024)Guided Equality SaturationProceedings of the ACM on Programming Languages10.1145/36329008:POPL(1727-1758)Online publication date: 5-Jan-2024
    • (2024)CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor ProgramsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629572(1054-1074)Online publication date: 22-Apr-2024
    • (2024)Cyclebite: Extracting Task Graphs From Unstructured Compute-ProgramsIEEE Transactions on Computers10.1109/TC.2023.332750473:1(221-234)Online publication date: 1-Jan-2024
    • (2024)Soter: Analytical Tensor-Architecture Modeling and Automatic Tensor Program Tuning for Spatial Accelerators2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00076(991-1004)Online publication date: 29-Jun-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media