A brief survey of HPC languages

Co-array Fortran (CAF)

Introduction

Co-Array Fortran is a small set of extensions to Fortran 95 for Single Program Multiple Data, SPMD, parallel processing.Co-Array Fortran is a simple syntactic extension to Fortran 95 that converts it into a robust, efficient parallel language. It looks and feels like Fortran and requires Fortran programmers to learn only a few new rules.
A coarray Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran is extended with additional trailing subscripts in square brackets to provide a concise representation of references to data that is spread across images.
The Fortran 2008 standard now includes coarrays; the syntax in the Fortran 2008 standard is slightly different from the original CAF proposal.

Features

Coarray Fortran (CAF) is a SPMD parallel programming model based on a small set of language extensions to Fortran 90. CAF supports access to non-local data using a natural extension to Fortran 90 syntax, lightweight and flexible synchronization primitives, pointers, and dynamic allocation of shared data.
An executing CAF program consists of a static collection of asynchronous process images. Like MPI programs, CAF programs explicitly manage locality, data and computation distribution; however, CAF is a shared-memory programming model based on one-sided communication. Rather than explicitly coding message exchanges to obtain off-processor data, CAF programs can directly reference off-processor values using an extension of Fortran 90 syntax for subscripted references. Since both remote data access and synchronization are expressed in the language, communication and synchronization are amenable to compiler-based optimizing transformations.

Uses/Adoption

There were no found no significant applications that use CAF. It is used mainly for scientific purposes. It can and is used in supercomputing.
Its main implementation is provided by Cray Fortran 90 compiler since release 3.1. Another implementation has been developed by the Los Alamos Computer Science Institute (LACSI), at Rice University. They working on an open-source, portable, retargetable, high-quality Co-Array Fortran compiler suitable for use with production codes.

Additional Information

http://caf.rice.edu/index.html
http://en.wikipedia.org/wiki/Co-array_Fortran

Robert W. Numrich and John Reid. Co-Array Fortran for parallel programming. ACM SIGPLAN Fortran Forum Archive, 17:1–31, August 1998.
C. Coarfa, Y. Dotsenko, J. Eckhardt, and J. Mellor-Crummey. Co-array Fortran performance and potential: An NPB experimental study. In 16th International Workshop on Languages and Compilers for Parallel Processing (LCPC), October 2003.
John Mellor-Crummey, Laksono Adhianto, William Scherer III, and Guohua Jin, A New Vision for Coarray Fortran, Proceedings PGAS09, 2009

Unified Parallel C (UPC)

http://upc.gwu.edu/

Introduction

Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines. The language provides a uniform programming model for both shared and distributed memory hardware. The programmer is presented with a single shared, partitioned address space, where variables may be directly read and written by any processor, but each variable is physically associated with a single processor. UPC uses a Single Program Multiple Data (SPMD) model of computation in which the amount of parallelism is fixed at program startup time, typically with a single thread of execution per processor.
In order to express parallelism, UPC extends ISO C 99 with the following constructs:

  • An explicitly parallel execution model
  • A shared address space
  • Synchronization primitives and a memory consistency model
  • Memory management primitives

Features

Under UPC, memory is composed of a shared memory space and a private memory space. A number of threads work independently and each of them can reference any address in the shared space, but only its own private space. The total number of threads is THREADS and each thread can identify itself using MYTHREAD, where THREADS and MYTHREAD can be seen as special constants. The shared space, however, is logically divided into partitions each with a special association (affinity) to a given thread. The idea is to make UPC enable the programmers, with proper declarations, to keep the shared data that will be dominantly processed by a given thread associated with that thread. Thus, a thread and the data that has affinity to it can likely be mapped by the system into the same physical node.
Since UPC is an explicit parallel extension of ISO C, all language features of C are already embodied in UPC. In addition, UPC declarations give the programmer control of the distribution of data across the threads. In addition, UPC supports dynamic shared memory allocations. There is generally no implicit synchronization in UPC. Therefore, the language offers a rich range of synchronization and memory consistency control constructs.

Usage/Adoption

There have been found demos and some applications, mainly scientific, which use UPC. Some demos are here: http://upc.lbl.gov/demos/ and some applications are here: http://www.upc.mtu.edu/applications.html .
There are compilers for UPC implemented by Cray, IBM and HP. There are also compilers implemented by UC Berkeley and Michigan Tech. There is also a GCC UPC compiler that extends the capabilities of the GNU GCC compiler.
License: Open-source (the exact license type varies for each implementation)

Additional Information

http://en.wikipedia.org/wiki/Unified_Parallel_C
http://upc.lbl.gov/
http://www.upc.mtu.edu/
http://gccupc.org/
http://www.alphaworks.ibm.com/tech/upccompiler
http://h21007.www2.hp.com/portal/site/dspp/menuitem.863c3e4cbcdc3f3515b49c108973a801/?ciid=c108e1c4dde02110e1c4dde02110275d6e10RCRD

W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and Language Specification. CCS-TR-99-157, IDA Center for Computing Sciences, 1999

Chapel

http://chapel.cray.com/

Introduction

Chapel is a new parallel programming language being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS). Chapel is designed to improve the productivity of high-end computer users while also serving as a portable parallel programming model that can be used on commodity clusters or desktop multicore systems. Chapel strives to vastly improve the programmability of large-scale parallel computers while matching or beating the performance and portability of current programming models like MPI.

Features

Chapel supports a multithreaded execution model via high-level abstractions for data parallelism, task parallelism, concurrency, and nested parallelism. Chapel's locale type enables users to specify and reason about the placement of data and tasks on a target architecture in order to tune for locality. Chapel supports global-view data aggregates with user-defined implementations, permitting operations on distributed data structures to be expressed in a natural manner. In contrast to many previous higher-level parallel languages, Chapel is designed around a multiresolution philosophy, permitting users to initially write very abstract code and then incrementally add more detail until they are as close to the machine as their needs require. Chapel supports code reuse and rapid prototyping via object-oriented design, type inference, and features for generic programming.

Usage/Adoption

Chapel is a new language and is not mature enough to be widely adopted. Chapel compiler still considered prototype, i.e. limited use for production environment.

License: BSD open-source license

Additional Information

http://en.wikipedia.org/wiki/Chapel_(programming_language)
http://www.prace-project.eu/documents/14_chapel_jg.pdf

D. Callahan, B. L. Chamberlain, and H. P. Zima. The Cascade high productivity language. In Proceedings of the Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 52–60. IEEE Computer Society, 2004
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. In Int’l J. High Performance Comp. Apps., volume 21, pages 291–312, Thousand Oaks, CA, USA, 2007. Sage Publications, Inc
SJ Deitz, BL Chamberlain, and MB Hribar. Chapel: Cascade High-Productivity Language An Overview of the Chapel Parallel Programming Model. cug.org

X10

http://x10-lang.org/

Introduction

X10 is a new programming language being developed at IBM Research in collaboration with academic partners. The X10 effort is part of the IBM PERCS project (Productive Easy-to-use Reliable Computer Systems) in the DARPA program on High Productivity Computer Systems.
X10 is a type-safe, parallel object-oriented language. It targets parallel systems with multi-core SMP nodes interconnected in scalable cluster configurations. A member of the Partitioned Global Address Space (PGAS) family of languages, X10 allows the programmer to explicitly manage locality via Places, lightweight activities embodied in async, constructs for termination detection (finish) and phased computation (clocks), and the manipulation of global arrays and data structures.

Features

X10 is designed specifically for parallel programming using the partitioned global address space (PGAS) model. A computation is divided among a set of places, each of which holds some data and hosts one or more activities that operate on those data. It supports a constrained type system for object-oriented programming, as well as user-defined primitive struct types; globally distributed arrays, and structured and unstructured parallelism. [2]
X10 uses the concept of parent and child relationships for activities to prevent the lock stalemate that can occur when two or more processes wait for each other to finish before they can complete. An activity may spawn one or more child activities, which may themselves have children. Children cannot wait for a parent to finish, but a parent can wait for a child using the finish command.

Usage/Adoption

The language is new and is still evolving. Its previous implementation was described as experimental.

License: Eclipse Public License

Additional Information

http://en.wikipedia.org/wiki/X10_(programming_language)
http://www.cs.purdue.edu/homes/xinb/cx10/CX10Report/
http://www.prace-project.eu/documents/15_x10_wl.pdf

Charles, P., Donawa, C., Ebcioglu, K., Grothoff, C., KIELSTRA, A., Sarkar, V., and Praun, C. V. X10: An object-oriented approach to non-uniform cluster computing. In Object-Oriented Programming, Systems, Languages & Applications (OOPSLA) (Oct. 2005), pp. 519–538
Vijay Saraswat et al. The X10 language specification. Technical report, IBM T.J. Watson Research Center, 2010

HMPP

http://www.caps-entreprise.com
http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36

HMPP allows rapid development of GPU accelerated applications. It is a workbench offering a high level abstraction for hybrid programming based on C and FORTRAN directives. It includes:
  • A C and Fortran compiler,
  • Data-parallel backends for NVIDIA CUDA and OpenCL, and
  • A runtime that makes use of the CUDA / OpenCL development tools and drivers and ensures application deployment on multi-GPU systems.

Software assets are kept independent from both hardware platforms and commercial software. By providing different target versions of computations that are offloaded to the available hardware compute units, an HMPP application dynamically adapts its execution to multi-GPUs systems and platform configuration, guaranteeing scalability and interoperability.

HMPP Workbench is based on OpenMP-like directive extensions for C and Fortran, used to build hardware accelerated variants of functions to be offloaded in hardware accelerators such as Nvidia Tesla (or any Cuda compatible hardware) and AMD FireStream. HMPP allows users to pipeline computations in multi-GPU systems and makes better use of asynchronous hardware features to build even better performing GPU accelerated applications.

With the HMPP target generators one can instantaneously prototype and evaluate the performance of the hardware accelerated critical functions. HMPP code is considered to be efficient, portable, and easy to develop and maintain.

HMPP uses codelet/callsite paired directives: codelet for routine implementation and callsite for routine invocation. Unique labels are used for referencing them.

Supported platforms: GPUs, all NVIDIA Tesla and AMD ATI FireStream
Supported compilers: Intel, GNU gcc , GNU gfortran, Open64, PGI, SUN
Supported Operating systems: Any x86_64 kernel 2.6 Linux with libc, g++, Windows

Usage / adoption:

The HMPP directives have been designed and used for more than 2 years by major HPC leaders.
CAPS and PathScale (a provider of high performance AMD64 and Intel64 compilers ) have jointly started working on advancing the HMPP directives as a new open standard. They aim to deliver a new evolution in the General-Purpose computation on Graphics Processing Units (GPGPU) programming model.

Licensing:
Not free, commercial and educational licenses.

Additional info:

PRACE seminar: http://www.prace-project.eu/news/prace-hosted-a-seminar-on-cuda-and-hmpp
http://www.caps-entreprise.com/upload/ckfinder/userfiles/files/caps_hmpp_ds.pdf
http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36
http://www.hpcprojects.com/products/product_details.php?product_id=621
http://www.drdobbs.com/high-performance-computing/225701323;jsessionid=HEWGJCK1MESBBQE1GHOSKHWATMY32JVN
http://www.ichec.ie/research/hmpp_intro.pdf
http://www.prace-project.eu/news/prace-hosted-a-seminar-on-cuda-and-hmpp