skip to main content
10.1145/2491956.2462176acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Published: 16 June 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.
    We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

    References

    [1]
    A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010.
    [2]
    J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009.
    [3]
    M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.
    [4]
    I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004.
    [5]
    J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007.
    [6]
    CoreImage. Apple CoreImage programming guide, 2006.
    [7]
    J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009.
    [8]
    C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.
    [9]
    K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006.
    [10]
    M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005.
    [11]
    M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.
    [12]
    P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.
    [13]
    J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012.
    [14]
    IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.
    [15]
    S. Kamil, C. Chan, L. Oliker, J. Shalf, and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.
    [16]
    S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007.
    [17]
    J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.
    [18]
    R. Moore. Interval Analysis. 1966.
    [19]
    A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010.
    [20]
    OpenMP. OpenMP. http://openmp.org/.
    [21]
    S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009.
    [22]
    S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011.
    [23]
    PixelBender. Adobe PixelBender reference, 2010.
    [24]
    H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991.
    [25]
    M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.
    [26]
    J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012.
    [27]
    M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994.
    [28]
    Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011.
    [29]
    W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002.
    [30]
    P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989.
    [31]
    X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.

    Cited By

    View all
    • (2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
    • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
    • Show More Cited By

    Index Terms

    1. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
      June 2013
      546 pages
      ISBN:9781450320146
      DOI:10.1145/2491956
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 6
        PLDI '13
        June 2013
        515 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2499370
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 June 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. autotuning
      2. compiler
      3. domain specific language
      4. gpu
      5. image processing
      6. locality
      7. optimization
      8. parallelism
      9. redundant computation
      10. vectorization

      Qualifiers

      • Research-article

      Conference

      PLDI '13
      Sponsor:

      Acceptance Rates

      PLDI '13 Paper Acceptance Rate 46 of 267 submissions, 17%;
      Overall Acceptance Rate 406 of 2,067 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)571
      • Downloads (Last 6 weeks)56
      Reflects downloads up to 14 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
      • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
      • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
      • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
      • (2024)NetBlocks: Staging Layouts for High-Performance Custom Host Network StacksProceedings of the ACM on Programming Languages10.1145/36563968:PLDI(467-491)Online publication date: 20-Jun-2024
      • (2024)A Verified Compiler for a Functional Tensor LanguageProceedings of the ACM on Programming Languages10.1145/36563908:PLDI(320-342)Online publication date: 20-Jun-2024
      • (2024)Interactive Source-to-Source Optimizations Validated using Static Resource AnalysisProceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis10.1145/3652588.3663320(26-34)Online publication date: 20-Jun-2024
      • (2024)Accelerated Auto-Tuning of GPU Kernels for Tensor ComputationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656626(549-561)Online publication date: 30-May-2024
      • (2024)Compiling Recurrences over Dense and Sparse ArraysProceedings of the ACM on Programming Languages10.1145/36498208:OOPSLA1(250-275)Online publication date: 29-Apr-2024
      • (2024)Dias: Dynamic Rewriting of Pandas CodeProceedings of the ACM on Management of Data10.1145/36393132:1(1-27)Online publication date: 26-Mar-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media