skip to main content
research-article
Open access

SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores

Published: 06 September 2017 Publication History

Abstract

SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences across different loops. SWITCHES introduces explicit task resource allocation mechanisms for efficient allocation of resources and adopts the latest OpenMP Application Programming Interface (API), as to maintain high levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Performance on an Intel Xeon-Phi shows good scalability and surpasses OpenMP by an average of 32%.

Supplementary Material

TACO1403-31 (taco1403-31.pdf)
Slide deck associated with this paper

References

[1]
2007. Cray XMT platforrm. (2007). http://www.cray.com/products/xmt/index.html. {Online}.
[2]
A. V. Aho, M. R. Garey, and J. D. Ullman. 1972. The transitive reduction of a directed graph. SIAM J. Comput. 1, 2 (1972), 131--137.
[3]
S. Arandi and P. Evripidou. 2010. Programming multi-core architectures using Data-Flow techniques. In International Conference on Embedded Computer Systems (SAMOS). 152--161.
[4]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press, Article 66, 11 pages.
[5]
Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, and Peter M. Kogge. 2004. A low cost, multithreaded processing-in-memory system. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI’04). ACM, New York, 16--22.
[6]
D. Cann. 1991. Retire Fortran? A debate rekindled. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, 264--272.
[7]
Transaction Processing Council. 2006. TPC Benchmark H (Decision Support). Standard Specification Revision 2.6.1. (2006).
[8]
L. Dagum and R. Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (Jan. 1998), 46--55.
[9]
J. B. Dennis. 1974. First version of a data flow procedure language. In Programming Symposium. Springer, 362--376.
[10]
J. B. Dennis and D. P. Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture (ISCA’75). ACM, New York, 126--132.
[11]
Andreas Diavastos. 2017. SWITCHES Platform. Retrieved from https://github.com/diavastos/SWITCHES. {Online}.
[12]
A. Diavastos, G. Stylianou, and G. Koutsou. 2016. Exploiting very-wide vectors on Intel Xeon Phi with lattice-QCD kernels. In Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 296--300.
[13]
A. Diavastos, G. Stylianou, and P. Trancoso. 2015. TFluxSCC: Exploiting performance on future many-core systems through data-flow. In Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 190--198.
[14]
Jiri Dokulil and Siegfried Benkner. 2015. Retargeting of the open community runtime to Intel Xeon Phi. Procedia Computer Science 51 (2015), 1453--1462.
[15]
A. Duran, E. Ayguad, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2 (2011), 173--193.
[16]
A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. 2009. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, 124--131.
[17]
M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, New York, 212--223.
[18]
M. C. Gilliland, B. J. Smith, and W. Calvert. 1976. Hep—A semaphore-synchronized multiprocessor with central control (heterogeneous element processor). In 1976 Summer Computer Simulation Conference. 57--62.
[19]
Roberto Giorgi, Rosa M. Badia, Franois Bodin, Albert Cohen, Paraskevas Evripidou, Paolo Faraboschi, Bernhard Fechner, Guang R. Gao, Arne Garbade, Rahul Gayatri, Sylvain Girbal, Daniel Goodman, Behran Khan, Souad Kolia, Joshua Landwehr, Nhat Minh L, Feng Li, Mikel Lujn, Avi Mendelson, Laurent Morin, Nacho Navarro, Tomasz Patejko, Antoniu Pop, Pedro Trancoso, Theo Ungerer, Ian Watson, Sebastian Weis, Stphane Zuckerman, and Mateo Valero. 2014. TERAFLUX: Harnessing dataflow in next generation teradevices. Microprocessors and Microsystems 38, 8, Part B (2014), 976--990.
[20]
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 6 (1996), 789--828.
[21]
J. R. Gurd, C. C. Kirkham, and I. Watson. 1985. The Manchester prototype dataflow computer. Communications of the ACM 28, 1 (Jan. 1985), 34--52.
[22]
Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (SC’99). ACM, New York, Article 57.
[23]
Q. Huang, Z. Huang, P. Werstein, and M. Purvis. 2008. GPU as a general purpose computing resource. In Proceedings of the 9th International Conference on Parallel and Distributed Computing, Applications and Technologies. 151--158.
[24]
Intel. 2016. Intel Many Integrated Core Architecture. Retrieved from http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core.
[25]
Intel. 2016. The Intel Xeon Phi coprocessor. Retrieved from http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.
[26]
W. M. Johnston, J. R. Paul Hanna, and R. J. Millar. 2004. Advances in dataflow programming languages. ACM Computing Surveys 36, 1 (March 2004), 1--34.
[27]
Richard M. Karp and Rayamond E. Miller. 1966. Properties of a model for parallel computations: Determinacy, termination, queueing. SIAM Journal on Applied Mathematics 14, 6 (1966), 1390--1411.
[28]
P. M. Kogge, S. C. Bass, J. B. Brockman, D. Z. Chen, and E. Sha. 1996. Pursuing a petaflop: Point designs for 100 TF computers using PIM technologies. In Frontiers’96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computing. 88--97.
[29]
A. Kukanov and M. J. Voss. 2007. The foundations for scalable multi-core software in intel threading building blocks. Intel Technology Journal 11, 4 (2007), 309--322.
[30]
C. Kyriacou, P. Evripidou, and P. Trancoso. 2006. Data-driven multithreading using conventional microprocessors. IEEE Transactions on Parallel and Distributed Systems 17, 10 (2006), 1176--1188.
[31]
C. Lauderdale, M. Glines, J. Zhao, A. Spiotta, and R. Khan. 2013. SWARM: A unified framework for parallel-for, task dataflow, and distributed graph traversal. ET International Inc., Newark (2013).
[32]
E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proceedings of the IEEE 75, 9 (Sept. 1987), 1235--1245.
[33]
B. Li, H. C. Chang, S. Song, C. Y. Su, T. Meyer, J. Mooring, and K. W. Cameron. 2014. The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In Proceedings of the 2014 IEEE International Parallel Distributed Processing Symposium Workshops. 1448--1456.
[34]
George Matheou and Paraskevas Evripidou. 2015. Architectural support for data-driven execution. ACM Transactions on Architecture and Code Optimization 11, 4 (Jan. 2015), Article 52, 25 pages.
[35]
T. G. Mattson, R. Cledat, V. Cav, V. Sarkar, Z. Budimli, S. Chatterjee, J. Fryman, I. Ganev, R. Knauerhase, Min Lee, B. Meister, B. Nickerson, N. Pepperling, B. Seshasayee, S. Tasirlar, J. Teller, and N. Vrvilo. 2016. The open community runtime: A runtime system for extreme scale computing. In Proceedings of the 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.
[36]
OpenMP Architecture Review Board. 2015. OpenMP 4.5 API C/C++ Syntax Reference Guide. Retrieved from http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.
[37]
G. M. Papadopoulos and D. E. Culler. 1990. Monsoon: An explicit token-store architecture. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). ACM, New York, 82--91.
[38]
P. Petrides, A. Diavastos, C. Christofi, and P. Trancoso. 2013. Scalability and efficiency of database queries on future many-core systems. In Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 24--28.
[39]
J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal of High Performance Computing Applications 23, 3 (2009), 284--299.
[40]
Red Hat Developer Program. 2016. What is new in OpenMP 4.5. Retrieved from https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3.
[41]
Red Hat Inc. 2003. The Native POSIX Thread Library for Linux. Red Hat Inc.
[42]
K. Stavrou, M. Nikolaides, D. Pavlou, S. Arandi, P. Evripidou, and P. Trancoso. 2008. TFlux: A portable platform for data-driven multithreading on commodity multicore systems. In Proceedings of the 37th International Conference on Parallel Processing. 25--34.
[43]
H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT 2011). 1--11.
[44]
P. Virouleau, P. Brunet, F. Broquedis, N. Furmento, S. Thibault, O. Aumage, and T. Gautier. 2014. Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite. In Proceedings of the 10th International Workshop on OpenMP (IWOMP’14). 16--29.
[45]
I. Watson, V. Woods, P. Watson, R. Banach, M. Greenberg, and J. Sargeant. 1988. Flagship: A parallel architecture for declarative programming. In Proceedings of the 15th Annual International Symposium on Computer Architecture (ISCA’88). IEEE Computer Society Press, 124--130.
[46]
K. B. Wheeler, R. C. Murphy, and D. Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing. 1--8.
[47]
S. Yoon and A. Jameson. 1988. Lower-upper symmetric-Gauss-Seidel method for the Euler and Navier-Stokes equations. AIAA Journal 26, 9 (1988), 1025--1026.
[48]
X. Zhou, H. Chen, S. Luo, Y. Gao, S. Yan, W. Liu, B. Lewis, and B. Saha. 2010. A case for software managed coherence in many-core processors. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (Poster Paper).
[49]
Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011. Using a “Codelet” program execution model for exascale machines: Position paper. In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT’11). ACM, New York, 64--69.

Cited By

View all
  • (2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
  • (2017)Auto-tuning Static Schedules for Task Data-flow ApplicationsProceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3152821.3152879(1-6)Online publication date: 9-Sep-2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 3
September 2017
278 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3132652
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 September 2017
Accepted: 01 July 2017
Revised: 01 June 2017
Received: 01 April 2017
Published in TACO Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Many-core
  2. SWITCHES
  3. dataflow
  4. parallel programming
  5. runtime system
  6. tasks

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)12
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less QueuingProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641568(100-112)Online publication date: 17-Feb-2024
  • (2017)Auto-tuning Static Schedules for Task Data-flow ApplicationsProceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3152821.3152879(1-6)Online publication date: 9-Sep-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media