research-article

Open access

SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores

Authors:

Andreas Diavastos,

Pedro TrancosoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 3

Article No.: 31, Pages 1 - 23

https://doi.org/10.1145/3127068

Published: 06 September 2017 Publication History

Abstract

SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences across different loops. SWITCHES introduces explicit task resource allocation mechanisms for efficient allocation of resources and adopts the latest OpenMP Application Programming Interface (API), as to maintain high levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Performance on an Intel Xeon-Phi shows good scalability and surpasses OpenMP by an average of 32%.

Supplementary Material

TACO1403-31 (taco1403-31.pdf)

Slide deck associated with this paper

Download
2.56 MB

References

[1]

2007. Cray XMT platforrm. (2007). http://www.cray.com/products/xmt/index.html. {Online}.

[2]

A. V. Aho, M. R. Garey, and J. D. Ullman. 1972. The transitive reduction of a directed graph. SIAM J. Comput. 1, 2 (1972), 131--137.

Digital Library

[3]

S. Arandi and P. Evripidou. 2010. Programming multi-core architectures using Data-Flow techniques. In International Conference on Embedded Computer Systems (SAMOS). 152--161.

[4]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press, Article 66, 11 pages.

Digital Library

[5]

Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, and Peter M. Kogge. 2004. A low cost, multithreaded processing-in-memory system. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI’04). ACM, New York, 16--22.

[6]

D. Cann. 1991. Retire Fortran? A debate rekindled. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, 264--272.

Digital Library

[7]

Transaction Processing Council. 2006. TPC Benchmark H (Decision Support). Standard Specification Revision 2.6.1. (2006).

[8]

L. Dagum and R. Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (Jan. 1998), 46--55.

Digital Library

[9]

J. B. Dennis. 1974. First version of a data flow procedure language. In Programming Symposium. Springer, 362--376.

Digital Library

[10]

J. B. Dennis and D. P. Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture (ISCA’75). ACM, New York, 126--132.

Digital Library

[11]

Andreas Diavastos. 2017. SWITCHES Platform. Retrieved from https://github.com/diavastos/SWITCHES. {Online}.

[12]

A. Diavastos, G. Stylianou, and G. Koutsou. 2016. Exploiting very-wide vectors on Intel Xeon Phi with lattice-QCD kernels. In Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 296--300.

[13]

A. Diavastos, G. Stylianou, and P. Trancoso. 2015. TFluxSCC: Exploiting performance on future many-core systems through data-flow. In Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 190--198.

Digital Library

[14]

Jiri Dokulil and Siegfried Benkner. 2015. Retargeting of the open community runtime to Intel Xeon Phi. Procedia Computer Science 51 (2015), 1453--1462.

Digital Library

[15]

A. Duran, E. Ayguad, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2 (2011), 173--193.

[16]

A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. 2009. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, 124--131.

Digital Library

[17]

M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, New York, 212--223.

Digital Library

[18]

M. C. Gilliland, B. J. Smith, and W. Calvert. 1976. Hep—A semaphore-synchronized multiprocessor with central control (heterogeneous element processor). In 1976 Summer Computer Simulation Conference. 57--62.

[19]

Roberto Giorgi, Rosa M. Badia, Franois Bodin, Albert Cohen, Paraskevas Evripidou, Paolo Faraboschi, Bernhard Fechner, Guang R. Gao, Arne Garbade, Rahul Gayatri, Sylvain Girbal, Daniel Goodman, Behran Khan, Souad Kolia, Joshua Landwehr, Nhat Minh L, Feng Li, Mikel Lujn, Avi Mendelson, Laurent Morin, Nacho Navarro, Tomasz Patejko, Antoniu Pop, Pedro Trancoso, Theo Ungerer, Ian Watson, Sebastian Weis, Stphane Zuckerman, and Mateo Valero. 2014. TERAFLUX: Harnessing dataflow in next generation teradevices. Microprocessors and Microsystems 38, 8, Part B (2014), 976--990.

Digital Library

[20]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 6 (1996), 789--828.

Digital Library

[21]

J. R. Gurd, C. C. Kirkham, and I. Watson. 1985. The Manchester prototype dataflow computer. Communications of the ACM 28, 1 (Jan. 1985), 34--52.

Digital Library

[22]

Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (SC’99). ACM, New York, Article 57.

Digital Library

[23]

Q. Huang, Z. Huang, P. Werstein, and M. Purvis. 2008. GPU as a general purpose computing resource. In Proceedings of the 9th International Conference on Parallel and Distributed Computing, Applications and Technologies. 151--158.

Digital Library

[24]

Intel. 2016. Intel Many Integrated Core Architecture. Retrieved from http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core.

[25]

Intel. 2016. The Intel Xeon Phi coprocessor. Retrieved from http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.

[26]

W. M. Johnston, J. R. Paul Hanna, and R. J. Millar. 2004. Advances in dataflow programming languages. ACM Computing Surveys 36, 1 (March 2004), 1--34.

Digital Library

[27]

Richard M. Karp and Rayamond E. Miller. 1966. Properties of a model for parallel computations: Determinacy, termination, queueing. SIAM Journal on Applied Mathematics 14, 6 (1966), 1390--1411.

[28]

P. M. Kogge, S. C. Bass, J. B. Brockman, D. Z. Chen, and E. Sha. 1996. Pursuing a petaflop: Point designs for 100 TF computers using PIM technologies. In Frontiers’96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computing. 88--97.

Digital Library

[29]

A. Kukanov and M. J. Voss. 2007. The foundations for scalable multi-core software in intel threading building blocks. Intel Technology Journal 11, 4 (2007), 309--322.

[30]

C. Kyriacou, P. Evripidou, and P. Trancoso. 2006. Data-driven multithreading using conventional microprocessors. IEEE Transactions on Parallel and Distributed Systems 17, 10 (2006), 1176--1188.

Digital Library

[31]

C. Lauderdale, M. Glines, J. Zhao, A. Spiotta, and R. Khan. 2013. SWARM: A unified framework for parallel-for, task dataflow, and distributed graph traversal. ET International Inc., Newark (2013).

[32]

E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proceedings of the IEEE 75, 9 (Sept. 1987), 1235--1245.

[33]

B. Li, H. C. Chang, S. Song, C. Y. Su, T. Meyer, J. Mooring, and K. W. Cameron. 2014. The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In Proceedings of the 2014 IEEE International Parallel Distributed Processing Symposium Workshops. 1448--1456.

Digital Library

[34]

George Matheou and Paraskevas Evripidou. 2015. Architectural support for data-driven execution. ACM Transactions on Architecture and Code Optimization 11, 4 (Jan. 2015), Article 52, 25 pages.

Digital Library

[35]

T. G. Mattson, R. Cledat, V. Cav, V. Sarkar, Z. Budimli, S. Chatterjee, J. Fryman, I. Ganev, R. Knauerhase, Min Lee, B. Meister, B. Nickerson, N. Pepperling, B. Seshasayee, S. Tasirlar, J. Teller, and N. Vrvilo. 2016. The open community runtime: A runtime system for extreme scale computing. In Proceedings of the 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.

[36]

OpenMP Architecture Review Board. 2015. OpenMP 4.5 API C/C++ Syntax Reference Guide. Retrieved from http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.

[37]

G. M. Papadopoulos and D. E. Culler. 1990. Monsoon: An explicit token-store architecture. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). ACM, New York, 82--91.

Digital Library

[38]

P. Petrides, A. Diavastos, C. Christofi, and P. Trancoso. 2013. Scalability and efficiency of database queries on future many-core systems. In Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 24--28.

Digital Library

[39]

J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal of High Performance Computing Applications 23, 3 (2009), 284--299.

Digital Library

[40]

Red Hat Developer Program. 2016. What is new in OpenMP 4.5. Retrieved from https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3.

[41]

Red Hat Inc. 2003. The Native POSIX Thread Library for Linux. Red Hat Inc.

[42]

K. Stavrou, M. Nikolaides, D. Pavlou, S. Arandi, P. Evripidou, and P. Trancoso. 2008. TFlux: A portable platform for data-driven multithreading on commodity multicore systems. In Proceedings of the 37th International Conference on Parallel Processing. 25--34.

Digital Library

[43]

H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT 2011). 1--11.

Digital Library

[44]

P. Virouleau, P. Brunet, F. Broquedis, N. Furmento, S. Thibault, O. Aumage, and T. Gautier. 2014. Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite. In Proceedings of the 10th International Workshop on OpenMP (IWOMP’14). 16--29.

[45]

I. Watson, V. Woods, P. Watson, R. Banach, M. Greenberg, and J. Sargeant. 1988. Flagship: A parallel architecture for declarative programming. In Proceedings of the 15th Annual International Symposium on Computer Architecture (ISCA’88). IEEE Computer Society Press, 124--130.

Digital Library

[46]

K. B. Wheeler, R. C. Murphy, and D. Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing. 1--8.

[47]

S. Yoon and A. Jameson. 1988. Lower-upper symmetric-Gauss-Seidel method for the Euler and Navier-Stokes equations. AIAA Journal 26, 9 (1988), 1025--1026.

[48]

X. Zhou, H. Chen, S. Luo, Y. Gao, S. Yan, W. Liu, B. Lewis, and B. Saha. 2010. A case for software managed coherence in many-core processors. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (Poster Paper).

[49]

Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011. Using a “Codelet” program execution model for exascale machines: Position paper. In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT’11). ACM, New York, 64--69.

Digital Library

Cited By

Wu QLi RBeard JJohn LRodríguez GSadayappan PSukumaran-Rajam A(2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641568
Diavastos ATrancoso P(2017)Auto-tuning Static Schedules for Task Data-flow ApplicationsProceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3152821.3152879(1-6)Online publication date: 9-Sep-2017
https://dl.acm.org/doi/10.1145/3152821.3152879

Index Terms

SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

MDR: performance model driven runtime for heterogeneous parallel platforms
ICS '11: Proceedings of the international conference on Supercomputing

We present a runtime framework for the execution of work-loads represented as parallel-operator directed acyclic graphs (PO-DAGs) on heterogeneous multi-core platforms. PO-DAGs combine coarse-grained parallelism at the graph level with fine-grained ...
Enhancing an x86_64 multi-core architecture with data-flow execution support
CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers

Future exascale machines will require multi--/ many-core architectures able to efficiently run multi-threaded applications. Data-flow execution models have demonstrated to be capable of improving execution performance by limiting the synchronization ...
TIDeFlow: The Time Iterated Dependency Flow Execution Model
DFM '11: Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing

The many-core revolution brought forward by recent advances in computer architecture has created immense challenges in the writing of parallel programs for High Performance Computing (HPC). Development of parallel HPC programs remains an art, and a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 3

September 2017

278 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3132652

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 September 2017

Accepted: 01 July 2017

Revised: 01 June 2017

Received: 01 April 2017

Published in TACO Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
591
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)12

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu QLi RBeard JJohn LRodríguez GSadayappan PSukumaran-Rajam A(2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641568
Diavastos ATrancoso P(2017)Auto-tuning Static Schedules for Task Data-flow ApplicationsProceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3152821.3152879(1-6)Online publication date: 9-Sep-2017
https://dl.acm.org/doi/10.1145/3152821.3152879

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents