SlideShare a Scribd company logo
1 of 79
Download to read offline
© 2019 Cray Inc.
chapel-lang.org
Chapel Comes of Age:
A Language for Productivity,
Parallelism, and Performance
Brad Chamberlain
HPC Knowledge Meeting `19
June 14, 2019
@ChapelLanguage
bradc@cray.com
© 2019 Cray Inc.
Education:
• Earned Ph.D. from University of Washington CSE in 2001
• focused on the ZPL data-parallel array language
• Remain associated with UW CSE as an Affiliate Professor
Industry:
• A Principal Engineer at Cray Inc.
• The technical lead / a founding member of the Chapel project
Who am I?
2
© 2019 Cray Inc.
What is Chapel?
Chapel: A modern parallel programming language
• portable & scalable
• open-source & collaborative
Goals:
• Support general parallel programming
• “any parallel algorithm on any parallel hardware”
• Make parallel programming at scale far more productive
3
© 2019 Cray Inc.
What does “Productivity” mean to you?
4
Recent Graduates:
“something similar to what I used in school: Python, Matlab, Java, …”
Seasoned HPC Programmers:
“that sugary stuff that I don’t need because I was born to suffer”
Computational Scientists:
“something that lets me express my parallel computations without having to wrestle
with architecture-specific details”
Chapel Team:
“something that lets computational scientists express what they want,
without taking away the control that HPC programmers want,
implemented in a language as attractive as recent graduates want.”
want full control to ensure performance”
© 2019 Cray Inc. 5
[Source: Kathy Yelick,
CHIUW 2018 keynote:
Why Languages Matter
More Than Ever]
© 2019 Cray Inc.
✓ Context and Motivation
Ø Chapel and Productivity
• A Brief Tour of Chapel Features
• Arkouda: NumPy over Chapel
• Summary and Resources
Outline
© 2019 Cray Inc.
Chapel aims to be as…
…programmable as Python
…fast as Fortran
…scalable as MPI, SHMEM, or UPC
…portable as C
…flexible as C++
…fun as [your favorite programming language]
Comparing Chapel to Other Languages
7
© 2019 Cray Inc.
Given: m-element vectors A, B, C
Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci
In pictures:
STREAM Triad: a trivial parallel computation
8
=
α
+
A
B
C
·
© 2019 Cray Inc.
Given: m-element vectors A, B, C
Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci
In pictures, in parallel (shared memory / mul:core):
STREAM Triad: a trivial parallel computation
9
A
B
C
=
+
·
=
+
·
=
+
·
=
+
·
α
© 2019 Cray Inc.
Given: m-element vectors A, B, C
Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci
In pictures, in parallel (distributed memory):
STREAM Triad: a trivial parallel computation
10
A
B
C
=
+
·
=
+
·
=
+
·
=
+
·
α
© 2019 Cray Inc.
Given: m-element vectors A, B, C
Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci
In pictures, in parallel (distributed memory mul9core):
STREAM Triad: a trivial parallel computation
11
A
B
C
α
=
+
·
=
+
·
=
+
·
=
+
·
=
+
·
=
+
·
=
+
·
=
+
·
© 2019 Cray Inc.
#include <hpcc.h>
#ifdef _OPENMP
#include <omp.h>
#endif
static int VectorSize;
static double *a, *b, *c;
int HPCC_StarStream(HPCC_Params *params) {
int myRank, commSize;
int rv, errCount;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_size( comm, &commSize );
MPI_Comm_rank( comm, &myRank );
rv = HPCC_Stream( params, 0 == myRank);
MPI_Reduce( &rv, &errCount, 1, MPI_INT, MPI_SUM, 0, comm );
return errCount;
}
int HPCC_Stream(HPCC_Params *params, int doIO) {
register int j;
double scalar;
VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 );
a = HPCC_XMALLOC( double, VectorSize );
b = HPCC_XMALLOC( double, VectorSize );
c = HPCC_XMALLOC( double, VectorSize );
if (!a || !b || !c) {
if (c) HPCC_free(c);
if (b) HPCC_free(b);
if (a) HPCC_free(a);
if (doIO) {
fprintf( outFile, "Failed to allocate memory (%d).n", VectorSize );
fclose( outFile );
}
return 1;
}
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (j=0; j<VectorSize; j++) {
b[j] = 2.0;
c[j] = 1.0;
}
scalar = 3.0;
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (j=0; j<VectorSize; j++)
a[j] = b[j]+scalar*c[j];
HPCC_free(c);
HPCC_free(b);
HPCC_free(a);
return 0;
}
STREAM Triad: C + MPI
12
© 2019 Cray Inc.
#include <hpcc.h>
#ifdef _OPENMP
#include <omp.h>
#endif
static int VectorSize;
static double *a, *b, *c;
int HPCC_StarStream(HPCC_Params *params) {
int myRank, commSize;
int rv, errCount;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_size( comm, &commSize );
MPI_Comm_rank( comm, &myRank );
rv = HPCC_Stream( params, 0 == myRank);
MPI_Reduce( &rv, &errCount, 1, MPI_INT, MPI_SUM, 0, comm );
return errCount;
}
int HPCC_Stream(HPCC_Params *params, int doIO) {
register int j;
double scalar;
VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 );
a = HPCC_XMALLOC( double, VectorSize );
b = HPCC_XMALLOC( double, VectorSize );
c = HPCC_XMALLOC( double, VectorSize );
if (!a || !b || !c) {
if (c) HPCC_free(c);
if (b) HPCC_free(b);
if (a) HPCC_free(a);
if (doIO) {
fprintf( outFile, "Failed to allocate memory (%d).n", VectorSize );
fclose( outFile );
}
return 1;
}
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (j=0; j<VectorSize; j++) {
b[j] = 2.0;
c[j] = 1.0;
}
scalar = 3.0;
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (j=0; j<VectorSize; j++)
a[j] = b[j]+scalar*c[j];
HPCC_free(c);
HPCC_free(b);
HPCC_free(a);
return 0;
}
STREAM Triad: C + MPI + OpenMP
13
© 2019 Cray Inc.
#include <hpcc.h>
#ifdef _OPENMP
#include <omp.h>
#endif
static int VectorSize;
static double *a, *b, *c;
int HPCC_StarStream(HPCC_Params *params) {
int myRank, commSize;
int rv, errCount;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_size( comm, &commSize );
MPI_Comm_rank( comm, &myRank );
rv = HPCC_Stream( params, 0 == myRank);
MPI_Reduce( &rv, &errCount, 1, MPI_INT, MPI_SUM, 0, comm );
return errCount;
}
int HPCC_Stream(HPCC_Params *params, int doIO) {
register int j;
double scalar;
VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 );
a = HPCC_XMALLOC( double, VectorSize );
b = HPCC_XMALLOC( double, VectorSize );
c = HPCC_XMALLOC( double, VectorSize );
if (!a || !b || !c) {
if (c) HPCC_free(c);
if (b) HPCC_free(b);
if (a) HPCC_free(a);
if (doIO) {
fprintf( outFile, "Failed to allocate memory (%d).n", VectorSize );
fclose( outFile );
}
return 1;
}
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (j=0; j<VectorSize; j++) {
b[j] = 2.0;
c[j] = 1.0;
}
scalar = 3.0;
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (j=0; j<VectorSize; j++)
a[j] = b[j]+scalar*c[j];
HPCC_free(c);
HPCC_free(b);
HPCC_free(a);
return 0;
}
STREAM Triad: Chapel
14
use …;
config const m = 1000,
alpha = 3.0;
const ProblemSpace = {1..m} dmapped …;
var A, B, C: [ProblemSpace] real;
B = 2.0;
C = 1.0;
A = B + alpha * C;
The special sauce:
How should this index
set—and the arrays and
computations over it—be
mapped to the system?
© 2019 Cray Inc.
better
HPCC STREAM Triad: Chapel vs. C+MPI+OpenMP
15
© 2019 Cray Inc.
Data Structure: distributed table
Computation: update random table locations in parallel
Two variations:
• lossless: don’t allow any updates to be lost
• lossy: permit some fraction of updates to be lost
HPCC Random Access (RA)
16
© 2019 Cray Inc.
Data Structure: distributed table
Computation: update random table locations in parallel
Two variations:
• lossless: don’t allow any updates to be lost
• lossy: permit some fraction of updates to be lost
HPCC Random Access (RA)
17
© 2019 Cray Inc.
HPCC RA: MPI kernel
18
/* Perform updates to main table. The scalar equivalent is:
*
* for (i=0; i<NUPDATE; i++) {
* Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0);
* Table[Ran & (TABSIZE-1)] ^= Ran;
* }
*/
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
while (i < SendCnt) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
NumberReceiving--;
} else
MPI_Abort( MPI_COMM_WORLD, -1 );
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
if (pendingUpdates < maxPendingUpdates) {
Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B);
GlobalOffset = Ran & (tparams.TableSize-1);
if ( GlobalOffset < tparams.Top)
WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) );
else
WhichPe = ( (GlobalOffset - tparams.Remainder) /
tparams.MinLocalTableSize );
if (WhichPe == tparams.MyProc) {
LocalOffset = (Ran & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= Ran;
} else {
HPCC_InsertUpdate(Ran, WhichPe, Buckets);
pendingUpdates++;
}
i++;
}
else {
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
}
/* send remaining updates in buckets */
while (pendingUpdates > 0) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
/* send our done messages */
for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) {
if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] =
MPI_REQUEST_NULL; continue; }
/* send garbage - who cares, no one will look at it */
MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG,
MPI_COMM_WORLD, tparams.finish_req + proc_count);
}
/* Finish everyone else up... */
while (NumberReceiving > 0) {
MPI_Wait(&inreq, &status);
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses);
© 2019 Cray Inc.
HPCC RA: MPI kernel comment vs. Chapel
19
/* Perform updates to main table. The scalar equivalent is:
*
* for (i=0; i<NUPDATE; i++) {
* Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0);
* Table[Ran & (TABSIZE-1)] ^= Ran;
* }
*/
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
while (i < SendCnt) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
NumberReceiving--;
} else
MPI_Abort( MPI_COMM_WORLD, -1 );
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
if (pendingUpdates < maxPendingUpdates) {
Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B);
GlobalOffset = Ran & (tparams.TableSize-1);
if ( GlobalOffset < tparams.Top)
WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) );
else
WhichPe = ( (GlobalOffset - tparams.Remainder) /
tparams.MinLocalTableSize );
if (WhichPe == tparams.MyProc) {
LocalOffset = (Ran & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= Ran;
} else {
HPCC_InsertUpdate(Ran, WhichPe, Buckets);
pendingUpdates++;
}
i++;
}
else {
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
}
/* send remaining updates in buckets */
while (pendingUpdates > 0) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
/* send our done messages */
for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) {
if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] =
MPI_REQUEST_NULL; continue; }
/* send garbage - who cares, no one will look at it */
MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG,
MPI_COMM_WORLD, tparams.finish_req + proc_count);
}
/* Finish everyone else up... */
while (NumberReceiving > 0) {
MPI_Wait(&inreq, &status);
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses);
/* Perform updates to main table. The scalar equivalent is:
*
* for (i=0; i<NUPDATE; i++) {
* Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0);
* Table[Ran & (TABSIZE-1)] ^= Ran;
* }
*/
MPI Comment
forall (_, r) in zip(Updates, RAStream()) do
T[r & indexMask].xor(r);
Chapel Kernel
© 2019 Cray Inc.
HPCC RA: Chapel vs. C+MPI
20
0
2
4
6
8
10
12
14
16 32 64 128 256
GUPS
Locales (x 36 cores / locale)
RA Performance (GUPS)
Chapel 1.19
MPI (bucketing)
better
© 2019 Cray Inc.
HPCC RA: MPI vs. Chapel
21
/* Perform updates to main table. The scalar equivalent is:
*
* for (i=0; i<NUPDATE; i++) {
* Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0);
* Table[Ran & (TABSIZE-1)] ^= Ran;
* }
*/
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
while (i < SendCnt) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
NumberReceiving--;
} else
MPI_Abort( MPI_COMM_WORLD, -1 );
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
if (pendingUpdates < maxPendingUpdates) {
Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B);
GlobalOffset = Ran & (tparams.TableSize-1);
if ( GlobalOffset < tparams.Top)
WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) );
else
WhichPe = ( (GlobalOffset - tparams.Remainder) /
tparams.MinLocalTableSize );
if (WhichPe == tparams.MyProc) {
LocalOffset = (Ran & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= Ran;
} else {
HPCC_InsertUpdate(Ran, WhichPe, Buckets);
pendingUpdates++;
}
i++;
}
else {
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
}
/* send remaining updates in buckets */
while (pendingUpdates > 0) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
/* send our done messages */
for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) {
if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] =
MPI_REQUEST_NULL; continue; }
/* send garbage - who cares, no one will look at it */
MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG,
MPI_COMM_WORLD, tparams.finish_req + proc_count);
}
/* Finish everyone else up... */
while (NumberReceiving > 0) {
MPI_Wait(&inreq, &status);
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses);
forall (_, r) in zip(Updates, RAStream()) do
T[r & indexMask].xor(r);
Chapel Kernel
© 2019 Cray Inc.
HPCC RA: MPI vs. Chapel
22
/* Perform updates to main table. The scalar equivalent is:
*
* for (i=0; i<NUPDATE; i++) {
* Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0);
* Table[Ran & (TABSIZE-1)] ^= Ran;
* }
*/
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
while (i < SendCnt) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
NumberReceiving--;
} else
MPI_Abort( MPI_COMM_WORLD, -1 );
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
if (pendingUpdates < maxPendingUpdates) {
Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B);
GlobalOffset = Ran & (tparams.TableSize-1);
if ( GlobalOffset < tparams.Top)
WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) );
else
WhichPe = ( (GlobalOffset - tparams.Remainder) /
tparams.MinLocalTableSize );
if (WhichPe == tparams.MyProc) {
LocalOffset = (Ran & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= Ran;
} else {
HPCC_InsertUpdate(Ran, WhichPe, Buckets);
pendingUpdates++;
}
i++;
}
else {
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
}
/* send remaining updates in buckets */
while (pendingUpdates > 0) {
/* receive messages */
do {
MPI_Test(&inreq, &have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
} while (have_done && NumberReceiving > 0);
MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe,
UPDATE_TAG, MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}
}
/* send our done messages */
for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) {
if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] =
MPI_REQUEST_NULL; continue; }
/* send garbage - who cares, no one will look at it */
MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG,
MPI_COMM_WORLD, tparams.finish_req + proc_count);
}
/* Finish everyone else up... */
while (NumberReceiving > 0) {
MPI_Wait(&inreq, &status);
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, tparams.dtype64, &recvUpdates);
bufferBase = 0;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (tparams.TableSize - 1)) –
tparams.GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
/* we got a done message. Thanks for playing... */
NumberReceiving--;
} else {
MPI_Abort( MPI_COMM_WORLD, -1 );
}
MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64,
MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq);
}
MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses);
forall (_, r) in zip(Updates, RAStream()) do
T[r & indexMask].xor(r);
Chapel Kernel
© 2019 Cray Inc. 23
[Source: Kathy Yelick,
CHIUW 2018 keynote:
Why Languages Matter
More Than Ever]
© 2019 Cray Inc.
© 2019 Cray Inc.
HPCC RA: Chapel vs. C+MPI
37
better
© 2018 Cray Inc.
ISx: Chapel vs. Reference
109
faster
© 2018 Cray Inc.
better
HPCC STREAM Triad: Chapel vs. Reference
107
24
HPC Patterns: Chapel vs. Reference
LCALS
STREAM
Triad
HPCC RA
PRK
StencilISx
Nightly performance tickers online at:
https://chapel-lang.org/perf-nightly.html
© 2018 Cray Inc.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
pressure_calc
energy_calc
vol3
d_calc
del_
dot_vec_2dcouple
fir
in
it3
m
ula
ddsubif_quadtra
p_in
t
hydro_1d
ic
cg
in
ner_prod
band_lin_eq
tridia
g_elim
eos
adi
in
t_predict
diff_
predic
t
first_
sumfirst_
diffpic
_2dpic
_1dhydro_2d
gen_lin_recurdis
c_ord
m
at_
x_m
at
pla
nckia
n
im
p_hydro_2d
fin
d_first_
m
in
Serial Kernels (long)
LCALS: Chapel vs. Reference
106faster
0
0.5
1
1.5
2
2.5
3
pressure_calc
energy_calc
vol3
d_calc
del_
dot_ve…
couple
fir
in
it3
m
ula
ddsub
if_quad
tra
p_in
t
pic
_2d
Parallel Kernels (long)
Normalized to reference ref Normalized to reference 1.17.1
© 2018 Cray Inc.
better
PRK Stencil: Chapel vs. Reference
108
Local loop kernels
Embarrassing/Pleasing
Parallelism
Stencil Boundary
Exchanges
Global Random
Updates
Bucket-Exchange
Pattern
© 2019 Cray Inc.
© 2019 Cray Inc.
HPCC RA: Chapel vs. C+MPI
37
better
© 2018 Cray Inc.
ISx: Chapel vs. Reference
109
faster
© 2018 Cray Inc.
better
HPCC STREAM Triad: Chapel vs. Reference
107
25
HPC Patterns: Chapel vs. Reference
LCALS
STREAM
Triad
HPCC RA
PRK
StencilISx
Nightly performance tickers online at:
https://chapel-lang.org/perf-nightly.html
© 2018 Cray Inc.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
pressure_calc
energy_calc
vol3
d_calc
del_
dot_vec_2dcouple
fir
in
it3
m
ula
ddsubif_quadtra
p_in
t
hydro_1d
ic
cg
in
ner_prod
band_lin_eq
tridia
g_elim
eos
adi
in
t_predict
diff_
predic
t
first_
sumfirst_
diffpic
_2dpic
_1dhydro_2d
gen_lin_recurdis
c_ord
m
at_
x_m
at
pla
nckia
n
im
p_hydro_2d
fin
d_first_
m
in
Serial Kernels (long)
LCALS: Chapel vs. Reference
106faster
0
0.5
1
1.5
2
2.5
3
pressure_calc
energy_calc
vol3
d_calc
del_
dot_ve…
couple
fir
in
it3
m
ula
ddsub
if_quad
tra
p_in
t
pic
_2d
Parallel Kernels (long)
Normalized to reference ref Normalized to reference 1.17.1
© 2018 Cray Inc.
better
PRK Stencil: Chapel vs. Reference
108
© 2019 Cray Inc.
A Brief Tour of
Chapel Features
26
© 2019 Cray Inc.
Domain Maps
Data Parallelism
Task Parallelism
Base Language
Target Machine
Locality Control
Chapel language concepts
27
Chapel Feature Areas
© 2019 Cray Inc. 28
Base Language
Domain Maps
Data Parallelism
Task Parallelism
Base Language
Target Machine
Locality Control
Lower-level Chapel
© 2019 Cray Inc. 29
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for f in fib(n) do
writeln(f);
0
1
1
2
3
5
8
…
© 2019 Cray Inc. 30
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for f in fib(n) do
writeln(f);
0
1
1
2
3
5
8
…
Configuration declarations
(support command-line overrides)
./fib –-n=1000000
© 2019 Cray Inc. 31
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for f in fib(n) do
writeln(f);
0
1
1
2
3
5
8
…
CLU-style
iterators
Modern
iterators
Iterators
© 2019 Cray Inc. 32
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for f in fib(n) do
writeln(f);
0
1
1
2
3
5
8
…
Static Type Inference for:
• arguments
• return types
• variables
Static Type Inference for:
• arguments
• return types
• variables
Static Type Inference for:
• arguments
• return types
• variables
Static Type Inference for:
• variables
• arguments
• return types
Static type inference for:
• arguments
• return types
• variables
© 2019 Cray Inc. 33
Base Language Features, by example
0
1
1
2
3
5
8
…
iter fib(n: int): int {
var current: int = 0,
next: int = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
Static Type Inference for:
• arguments
• return types
• variables
Static Type Inference for:
• arguments
• return types
• variables
Static Type Inference for:
• variables
• arguments
• return types
config const n: int = 10;
for f in fib(n) do
writeln(f);
Explicit types also
supported
© 2019 Cray Inc. 34
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for f in fib(n) do
writeln(f);
0
1
1
2
3
5
8
…
© 2019 Cray Inc. 35
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for (i,f) in zip(0..#n, fib(n)) do
writeln("fib #", i, " is ", f);
fib #0 is 0
fib #1 is 1
fib #2 is 1
fib #3 is 2
fib #4 is 3
fib #5 is 5
fib #6 is 8
…
Zippered iteration
© 2019 Cray Inc. 36
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for (i,f) in zip(0..#n, fib(n)) do
writeln("fib #", i, " is ", f);
fib #0 is 0
fib #1 is 1
fib #2 is 1
fib #3 is 2
fib #4 is 3
fib #5 is 5
fib #6 is 8
…
range types and
operators
Range types and
operators
© 2019 Cray Inc. 37
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for (i,f) in zip(0..#n, fib(n)) do
writeln("fib #", i, " is ", f);
fib #0 is 0
fib #1 is 1
fib #2 is 1
fib #3 is 2
fib #4 is 3
fib #5 is 5
fib #6 is 8
…
Tuples
© 2019 Cray Inc. 38
Base Language Features, by example
iter fib(n) {
var current = 0,
next = 1;
for i in 1..n {
yield current;
current += next;
current <=> next;
}
}
config const n = 10;
for (i,f) in zip(0..#n, fib(n)) do
writeln("fib #", i, " is ", f);
fib #0 is 0
fib #1 is 1
fib #2 is 1
fib #3 is 2
fib #4 is 3
fib #5 is 5
fib #6 is 8
…
© 2019 Cray Inc.
• Object-oriented programming (value- and reference-based)
• Managed objects and lifetime checking
• Nilable vs. non-nilable class variables
• Generic programming / polymorphism
• Error-handling
• Compile-time meta-programming
• Modules (supporting namespaces)
• Procedure overloading / filtering
• Arguments: default values, intents, name-based matching, type queries
• and more…
Other Base Language Features
39
© 2019 Cray Inc. 40
Task Parallelism and Locality Control
Domain Maps
Data Parallelism
Task Parallelism
Base Language
Target Machine
Locality Control
© 2019 Cray Inc.
• Locales can run tasks and store variables
• Think “compute node”
• Number of locales specified on execution command-line
> ./myProgram --numLocales=4 # or `–nl 4`
Locales, briefly
41
locale locale locale locale
Locales:
0 1 2 3
User’s main() executes on locale #0
© 2019 Cray Inc. 42
Task Parallelism and Locality, by example
taskParallel.chpl
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
prompt> chpl taskParallel.chpl
prompt> ./taskParallel
Hello from task 2 of 2 running on n1032
Hello from task 1 of 2 running on n1032
© 2019 Cray Inc. 43
Task Parallelism and Locality, by example
taskParallel.chpl
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
Abstraction of
System Resources
prompt> chpl taskParallel.chpl
prompt> ./taskParallel
Hello from task 2 of 2 running on n1032
Hello from task 1 of 2 running on n1032
© 2019 Cray Inc. 44
Task Parallelism and Locality, by example
taskParallel.chpl
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
High-Level
Task Parallelism
prompt> chpl taskParallel.chpl
prompt> ./taskParallel
Hello from task 2 of 2 running on n1032
Hello from task 1 of 2 running on n1032
© 2019 Cray Inc. 45
Task Parallelism and Locality, by example
taskParallel.chpl
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
prompt> chpl taskParallel.chpl
prompt> ./taskParallel
Hello from task 2 of 2 running on n1032
Hello from task 1 of 2 running on n1032
So far, this is a shared memory program
Nothing refers to remote locales,
explicitly or implicitly
© 2019 Cray Inc. 46
Task Parallelism and Locality, by example
taskParallel.chpl
coforall loc in Locales do
on loc {
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
}
prompt> chpl taskParallel.chpl
prompt> ./taskParallel –-numLocales=2
Hello from task 1 of 2 running on n1033
Hello from task 2 of 2 running on n1032
Hello from task 2 of 2 running on n1033
Hello from task 1 of 2 running on n1032
© 2019 Cray Inc. 47
Task Parallelism and Locality, by example
taskParallel.chpl
coforall loc in Locales do
on loc {
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
}
prompt> chpl taskParallel.chpl
prompt> ./taskParallel –-numLocales=2
Hello from task 1 of 2 running on n1033
Hello from task 2 of 2 running on n1032
Hello from task 2 of 2 running on n1033
Hello from task 1 of 2 running on n1032
Abstraction of
System Resources
© 2019 Cray Inc. 48
Task Parallelism and Locality, by example
taskParallel.chpl
coforall loc in Locales do
on loc {
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
}
prompt> chpl taskParallel.chpl
prompt> ./taskParallel –-numLocales=2
Hello from task 1 of 2 running on n1033
Hello from task 2 of 2 running on n1032
Hello from task 2 of 2 running on n1033
Hello from task 1 of 2 running on n1032
High-Level
Task Parallelism
© 2019 Cray Inc. 49
Task Parallelism and Locality, by example
taskParallel.chpl
coforall loc in Locales do
on loc {
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
}
Control of Locality/Affinity
prompt> chpl taskParallel.chpl
prompt> ./taskParallel –-numLocales=2
Hello from task 1 of 2 running on n1033
Hello from task 2 of 2 running on n1032
Hello from task 2 of 2 running on n1033
Hello from task 1 of 2 running on n1032
© 2019 Cray Inc. 50
Task Parallelism and Locality, by example
taskParallel.chpl
coforall loc in Locales do
on loc {
const numTasks = here.numPUs();
coforall tid in 1..numTasks do
writef("Hello from task %n of %n "+
"running on %sn",
tid, numTasks, here.name);
}
prompt> chpl taskParallel.chpl
prompt> ./taskParallel –-numLocales=2
Hello from task 1 of 2 running on n1033
Hello from task 2 of 2 running on n1032
Hello from task 2 of 2 running on n1033
Hello from task 1 of 2 running on n1032
© 2019 Cray Inc.
• atomic / synchronized variables: for sharing data & coordination
• begin / cobegin statements: other ways of creating tasks
Other Task Parallel Features
51
© 2019 Cray Inc.
Task Parallelism
Base Language
Target Machine
Locality Control
Chapel language concepts
52
Data Parallelism in Chapel
Higher-level
Chapel
Domain Maps
Data Parallelism
© 2019 Cray Inc. 53
Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
config const n = 1000;
var D = {1..n, 1..n};
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chpl
© 2019 Cray Inc. 54
Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
config const n = 1000;
var D = {1..n, 1..n};
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chplDomains (Index Sets)
© 2019 Cray Inc. 55
Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
config const n = 1000;
var D = {1..n, 1..n};
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chpl
Arrays
© 2019 Cray Inc. 56
Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
config const n = 1000;
var D = {1..n, 1..n};
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chpl
Data-Parallel Forall Loops
© 2019 Cray Inc. 57
Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
config const n = 1000;
var D = {1..n, 1..n};
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chpl
So far, this is a shared memory program
Nothing refers to remote locales,
explicitly or implicitly
© 2019 Cray Inc. 58
Distributed Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5 --numLocales=4
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
use CyclicDist;
config const n = 1000;
var D = {1..n, 1..n}
dmapped Cyclic(startIdx = (1,1));
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chpl
Domain Maps
(Map Data Parallelism to the System)
© 2019 Cray Inc. 59
Distributed Data Parallelism, by example
prompt> chpl dataParallel.chpl
prompt> ./dataParallel –-n=5 --numLocales=4
1.1 1.3 1.5 1.7 1.9
2.1 2.3 2.5 2.7 2.9
3.1 3.3 3.5 3.7 3.9
4.1 4.3 4.5 4.7 4.9
5.1 5.3 5.5 5.7 5.9
use CyclicDist;
config const n = 1000;
var D = {1..n, 1..n}
dmapped Cyclic(startIdx = (1,1));
var A: [D] real;
forall (i,j) in D do
A[i,j] = i + (j - 0.5)/n;
writeln(A);
dataParallel.chpl
© 2019 Cray Inc.
• Parallel Iterators and Zippering
• Slicing: refer to subarrays using ranges / domains
• Promotion: execute scalar functions in parallel using array arguments
• Reductions: collapse arrays to scalars or subarrays
• Scans: parallel prefix operations
• Several Domain/Array Types
Other Data Parallel Features
60
© 2019 Cray Inc.
STREAM Triad and HPCC RA Kernel, revisited
61
use …;
config const m = 1000,
alpha = 3.0;
const ProblemSpace = {1..m} dmapped …;
var A, B, C: [ProblemSpace] real;
B = 2.0;
C = 1.0;
A = B + alpha * C;
forall (_, r) in zip(Updates, RAStream()) do
T[r & indexMask].xor(r);
© 2019 Cray Inc.
Arkouda:
NumPy over Chapel
62
© 2019 Cray Inc.
Motivation: Say you’ve got…
…a bunch of Python programmers
…HPC-scale problems to solve
…access to HPC systems
How should you leverage these Python programmers to get your work done?
Concept: Develop Python libraries that are implemented in Chapel
⇒ get performance, as with Python-over-C, but also parallelism + scalability
Even Better: use NumPy interfaces to make it trivial / familiar for users
Arkouda: Context
63
© 2019 Cray Inc.
Sample use from Jupyter
64
© 2019 Cray Inc.
By taking this approach, this user was able to:
• interact with a distributed, running Chapel program from Python within Jupyter
• run the same back-end program on…
…a Cray XC
…an HPE Superdome X
…an Infiniband cluster
…a Mac laptop
• compute on TB-sized arrays in seconds
• with ~1 month of effort
Arkouda Accomplishments
65
© 2019 Cray Inc.
• High level — makes for less code
• Great support for array operations and distributed arrays
• Direct support for synchronized/atomic variables
• Close to “Pythonic” (for a statically typed language)
• Provides a gateway for data scientists ready to go beyond Python
• Portability and Scalability
• Same code runs on shared- or distributed-memory systems
• “From Raspberry Pi to Cray XC”
• Integrates with (distributed) numerical libraries (e.g., FFTW, FFTW-MPI)
“Why Chapel?”
66
© 2019 Cray Inc.
Summary and
Resources
67
© 2019 Cray Inc.
Chapel cleanly and orthogonally supports…
…expression of parallelism and locality
…specifying how to map computations to the system
Chapel is powerful:
• supports succinct, straightforward code
• can result in performance that competes with (or beats) C+MPI+OpenMP
Chapel is attractive to Python programmers
• as a native language: similarly readable / writeable, yet scalable
• as an implementation option for Python libraries
Summary of this Talk
68
© 2019 Cray Inc.
Chapel Central
69
https://chapel-lang.org
● downloads
● presentations
● papers
● resources
● documentation
© 2019 Cray Inc.
https://chapel-lang.org/docs: ~200 pages, including primer examples
Chapel Online Documentation
70
© 2019 Cray Inc.
read-only mailing list: chapel-announce@lists.sourceforge.net (~15 mails / year)
Chapel Community
71
https://stackoverflow.com/questions/tagged/chapel
https://github.com/chapel-lang/chapel/issues
https://gitter.im/chapel-lang/chapel
© 2019 Cray Inc.
Chapel Social Media (no account required)
72
http://twitter.com/ChapelLanguage
http://facebook.com/ChapelLanguage
https://www.youtube.com/channel/UCHmm27bYjhknK5mU7ZzPGsQ/
© 2019 Cray Inc.
Chapel chapter from Programming Models for Parallel Computing
• a detailed overview of Chapel’s history, motivating themes, features
• published by MIT Press, November 2015
• edited by Pavan Balaji (Argonne)
• chapter is also available online
Suggested Reading: Chapel history and overview
73
© 2019 Cray Inc.
Suggested Reading: Recent Progress (CUG 2018)
74
paper and slides available at chapel-lang.org
© 2019 Cray Inc.
The 6th Annual Chapel Implementers and Users Workshop
• sponsored by ACM SIGPLAN
• held in conjunction with PLDI 2019, FCRC 2019
• June 22-23, Phoenix AZ
Keynote: Anshu Dubey (Argonne)
Programming Abstractions for Orchestration of HPC Scientific Computing
CHIUW 2019 (https://chapel-lang.org/CHIUW2019.html)
75
© 2019 Cray Inc.
Community Talks:
• Arkouda
• hybrid CPU-GPU computations with Chapel
• Stencil computations: getting performance for a student exercise
• Chapel Graph Library
Cray talks:
• Chapel’s use in AI / HPO
• interoperability improvements for Python, C, and Fortran
• recent performance optimizations
• radix sort in Chapel
Also: state of the project, lightning talks, coding day
CHIUW 2019 (https://chapel-lang.org/CHIUW2019.html)
76
© 2019 Cray Inc.
Chapel cleanly and orthogonally supports…
…expression of parallelism and locality
…specifying how to map computations to the system
Chapel is powerful:
• supports succinct, straightforward code
• can result in performance that competes with (or beats) C+MPI+OpenMP
Chapel is attractive to Python programmers
• as a native language: similarly readable / writeable, yet scalable
• as an implementation option for Python libraries
Summary of this Talk
77
© 2019 Cray Inc.
S A F E H A R B O R
S TAT E M E N T
This presentation may contain forward-looking
statements that are based on our current
expectations. Forward looking statements may
include statements about our financial
guidance and expected operating results, our
opportunities and future potential, our product
development and new product introduction
plans, our ability to expand and penetrate our
addressable markets and other statements
that are not historical facts.
These statements are only predictions and
actual results may materially vary from those
projected. Please refer to Cray's documents
filed with the SEC from time to time
concerning factors that could affect the
Company and these forward-looking
statements.
78
THANK YOU
Q U E S T I O N S ?
@ChapelLanguage
chapel-lang.org
bradc@cray.com
@cray_inc
linkedin.com/company/cray-inc-/
cray.com

More Related Content

What's hot

Histogram dan Segmentasi 2
Histogram dan Segmentasi 2Histogram dan Segmentasi 2
Histogram dan Segmentasi 2Lusiana Diyan
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...
CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...
CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...Aalto University
 
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...InfluxData
 
How the Go runtime implement maps efficiently
How the Go runtime implement maps efficientlyHow the Go runtime implement maps efficiently
How the Go runtime implement maps efficientlyTing-Li Chou
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
On Context-Orientation in Aggregate Programming
On Context-Orientation in Aggregate ProgrammingOn Context-Orientation in Aggregate Programming
On Context-Orientation in Aggregate ProgrammingRoberto Casadei
 
Ku bangkok sw-eng-dp-02_smart_board_strategypattern
Ku bangkok sw-eng-dp-02_smart_board_strategypatternKu bangkok sw-eng-dp-02_smart_board_strategypattern
Ku bangkok sw-eng-dp-02_smart_board_strategypatternভবঘুরে ঝড়
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
C mcq practice test 1
C mcq practice test 1C mcq practice test 1
C mcq practice test 1Aman Kamboj
 
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPyDong-hee Na
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPMiller Lee
 
Integrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBufIntegrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBufRomain Francois
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the UntunableSigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the UntunableSigOpt
 

What's hot (17)

Histogram dan Segmentasi 2
Histogram dan Segmentasi 2Histogram dan Segmentasi 2
Histogram dan Segmentasi 2
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
C Programming
C ProgrammingC Programming
C Programming
 
CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...
CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...
CHI 2014 talk by Antti Oulasvirta: Automated Nonlinear Regression Modeling fo...
 
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
 
How the Go runtime implement maps efficiently
How the Go runtime implement maps efficientlyHow the Go runtime implement maps efficiently
How the Go runtime implement maps efficiently
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
On Context-Orientation in Aggregate Programming
On Context-Orientation in Aggregate ProgrammingOn Context-Orientation in Aggregate Programming
On Context-Orientation in Aggregate Programming
 
Ku bangkok sw-eng-dp-02_smart_board_strategypattern
Ku bangkok sw-eng-dp-02_smart_board_strategypatternKu bangkok sw-eng-dp-02_smart_board_strategypattern
Ku bangkok sw-eng-dp-02_smart_board_strategypattern
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
C mcq practice test 1
C mcq practice test 1C mcq practice test 1
C mcq practice test 1
 
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
 
Beyond C++17
Beyond C++17Beyond C++17
Beyond C++17
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMP
 
Integrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBufIntegrating R with C++: Rcpp, RInside and RProtoBuf
Integrating R with C++: Rcpp, RInside and RProtoBuf
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the UntunableSigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
 

Similar to Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance

C BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHF
C BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHFC BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHF
C BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHFAbcdR5
 
CPP Programming Homework Help
CPP Programming Homework HelpCPP Programming Homework Help
CPP Programming Homework HelpC++ Homework Help
 
C++ lab assignment
C++ lab assignmentC++ lab assignment
C++ lab assignmentSaket Pathak
 
Introduction to c programming
Introduction to c programmingIntroduction to c programming
Introduction to c programmingAlpana Gupta
 
Deep C Programming
Deep C ProgrammingDeep C Programming
Deep C ProgrammingWang Hao Lee
 
2. data, operators, io
2. data, operators, io2. data, operators, io
2. data, operators, iohtaitk
 
Introduction to programming c and data structures
Introduction to programming c and data structuresIntroduction to programming c and data structures
Introduction to programming c and data structuresPradipta Mishra
 
Introduction to programming c and data-structures
Introduction to programming c and data-structures Introduction to programming c and data-structures
Introduction to programming c and data-structures Pradipta Mishra
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
C multiple choice questions and answers pdf
C multiple choice questions and answers pdfC multiple choice questions and answers pdf
C multiple choice questions and answers pdfchoconyeuquy
 

Similar to Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance (20)

hybrid-programming.pptx
hybrid-programming.pptxhybrid-programming.pptx
hybrid-programming.pptx
 
C BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHF
C BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHFC BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHF
C BASICS.pptx FFDJF/,DKFF90DF SDPJKFJ[DSSIFLHDSHF
 
CPP Programming Homework Help
CPP Programming Homework HelpCPP Programming Homework Help
CPP Programming Homework Help
 
Computer Science Assignment Help
 Computer Science Assignment Help  Computer Science Assignment Help
Computer Science Assignment Help
 
C++ manual Report Full
C++ manual Report FullC++ manual Report Full
C++ manual Report Full
 
C++ lab assignment
C++ lab assignmentC++ lab assignment
C++ lab assignment
 
Introduction to c programming
Introduction to c programmingIntroduction to c programming
Introduction to c programming
 
Programming in c
Programming in cProgramming in c
Programming in c
 
Deep C Programming
Deep C ProgrammingDeep C Programming
Deep C Programming
 
2. data, operators, io
2. data, operators, io2. data, operators, io
2. data, operators, io
 
Introduction to programming c and data structures
Introduction to programming c and data structuresIntroduction to programming c and data structures
Introduction to programming c and data structures
 
C++ Question
C++ QuestionC++ Question
C++ Question
 
C Programming Homework Help
C Programming Homework HelpC Programming Homework Help
C Programming Homework Help
 
Introduction to programming c and data-structures
Introduction to programming c and data-structures Introduction to programming c and data-structures
Introduction to programming c and data-structures
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Python gis
Python gisPython gis
Python gis
 
Programming in C
Programming in CProgramming in C
Programming in C
 
The basics of c programming
The basics of c programmingThe basics of c programming
The basics of c programming
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
C multiple choice questions and answers pdf
C multiple choice questions and answers pdfC multiple choice questions and answers pdf
C multiple choice questions and answers pdf
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 

Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance

  • 1. © 2019 Cray Inc. chapel-lang.org Chapel Comes of Age: A Language for Productivity, Parallelism, and Performance Brad Chamberlain HPC Knowledge Meeting `19 June 14, 2019 @ChapelLanguage bradc@cray.com
  • 2. © 2019 Cray Inc. Education: • Earned Ph.D. from University of Washington CSE in 2001 • focused on the ZPL data-parallel array language • Remain associated with UW CSE as an Affiliate Professor Industry: • A Principal Engineer at Cray Inc. • The technical lead / a founding member of the Chapel project Who am I? 2
  • 3. © 2019 Cray Inc. What is Chapel? Chapel: A modern parallel programming language • portable & scalable • open-source & collaborative Goals: • Support general parallel programming • “any parallel algorithm on any parallel hardware” • Make parallel programming at scale far more productive 3
  • 4. © 2019 Cray Inc. What does “Productivity” mean to you? 4 Recent Graduates: “something similar to what I used in school: Python, Matlab, Java, …” Seasoned HPC Programmers: “that sugary stuff that I don’t need because I was born to suffer” Computational Scientists: “something that lets me express my parallel computations without having to wrestle with architecture-specific details” Chapel Team: “something that lets computational scientists express what they want, without taking away the control that HPC programmers want, implemented in a language as attractive as recent graduates want.” want full control to ensure performance”
  • 5. © 2019 Cray Inc. 5 [Source: Kathy Yelick, CHIUW 2018 keynote: Why Languages Matter More Than Ever]
  • 6. © 2019 Cray Inc. ✓ Context and Motivation Ø Chapel and Productivity • A Brief Tour of Chapel Features • Arkouda: NumPy over Chapel • Summary and Resources Outline
  • 7. © 2019 Cray Inc. Chapel aims to be as… …programmable as Python …fast as Fortran …scalable as MPI, SHMEM, or UPC …portable as C …flexible as C++ …fun as [your favorite programming language] Comparing Chapel to Other Languages 7
  • 8. © 2019 Cray Inc. Given: m-element vectors A, B, C Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci In pictures: STREAM Triad: a trivial parallel computation 8 = α + A B C ·
  • 9. © 2019 Cray Inc. Given: m-element vectors A, B, C Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci In pictures, in parallel (shared memory / mul:core): STREAM Triad: a trivial parallel computation 9 A B C = + · = + · = + · = + · α
  • 10. © 2019 Cray Inc. Given: m-element vectors A, B, C Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci In pictures, in parallel (distributed memory): STREAM Triad: a trivial parallel computation 10 A B C = + · = + · = + · = + · α
  • 11. © 2019 Cray Inc. Given: m-element vectors A, B, C Compute: ∀i ∈ 1..m, Ai = Bi + α⋅Ci In pictures, in parallel (distributed memory mul9core): STREAM Triad: a trivial parallel computation 11 A B C α = + · = + · = + · = + · = + · = + · = + · = + ·
  • 12. © 2019 Cray Inc. #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myRank, commSize; int rv, errCount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commSize ); MPI_Comm_rank( comm, &myRank ); rv = HPCC_Stream( params, 0 == myRank); MPI_Reduce( &rv, &errCount, 1, MPI_INT, MPI_SUM, 0, comm ); return errCount; } int HPCC_Stream(HPCC_Params *params, int doIO) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); if (!a || !b || !c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doIO) { fprintf( outFile, "Failed to allocate memory (%d).n", VectorSize ); fclose( outFile ); } return 1; } #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) { b[j] = 2.0; c[j] = 1.0; } scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0; } STREAM Triad: C + MPI 12
  • 13. © 2019 Cray Inc. #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myRank, commSize; int rv, errCount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commSize ); MPI_Comm_rank( comm, &myRank ); rv = HPCC_Stream( params, 0 == myRank); MPI_Reduce( &rv, &errCount, 1, MPI_INT, MPI_SUM, 0, comm ); return errCount; } int HPCC_Stream(HPCC_Params *params, int doIO) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); if (!a || !b || !c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doIO) { fprintf( outFile, "Failed to allocate memory (%d).n", VectorSize ); fclose( outFile ); } return 1; } #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) { b[j] = 2.0; c[j] = 1.0; } scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0; } STREAM Triad: C + MPI + OpenMP 13
  • 14. © 2019 Cray Inc. #include <hpcc.h> #ifdef _OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC_StarStream(HPCC_Params *params) { int myRank, commSize; int rv, errCount; MPI_Comm comm = MPI_COMM_WORLD; MPI_Comm_size( comm, &commSize ); MPI_Comm_rank( comm, &myRank ); rv = HPCC_Stream( params, 0 == myRank); MPI_Reduce( &rv, &errCount, 1, MPI_INT, MPI_SUM, 0, comm ); return errCount; } int HPCC_Stream(HPCC_Params *params, int doIO) { register int j; double scalar; VectorSize = HPCC_LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC_XMALLOC( double, VectorSize ); b = HPCC_XMALLOC( double, VectorSize ); c = HPCC_XMALLOC( double, VectorSize ); if (!a || !b || !c) { if (c) HPCC_free(c); if (b) HPCC_free(b); if (a) HPCC_free(a); if (doIO) { fprintf( outFile, "Failed to allocate memory (%d).n", VectorSize ); fclose( outFile ); } return 1; } #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) { b[j] = 2.0; c[j] = 1.0; } scalar = 3.0; #ifdef _OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) a[j] = b[j]+scalar*c[j]; HPCC_free(c); HPCC_free(b); HPCC_free(a); return 0; } STREAM Triad: Chapel 14 use …; config const m = 1000, alpha = 3.0; const ProblemSpace = {1..m} dmapped …; var A, B, C: [ProblemSpace] real; B = 2.0; C = 1.0; A = B + alpha * C; The special sauce: How should this index set—and the arrays and computations over it—be mapped to the system?
  • 15. © 2019 Cray Inc. better HPCC STREAM Triad: Chapel vs. C+MPI+OpenMP 15
  • 16. © 2019 Cray Inc. Data Structure: distributed table Computation: update random table locations in parallel Two variations: • lossless: don’t allow any updates to be lost • lossy: permit some fraction of updates to be lost HPCC Random Access (RA) 16
  • 17. © 2019 Cray Inc. Data Structure: distributed table Computation: update random table locations in parallel Two variations: • lossless: don’t allow any updates to be lost • lossy: permit some fraction of updates to be lost HPCC Random Access (RA) 17
  • 18. © 2019 Cray Inc. HPCC RA: MPI kernel 18 /* Perform updates to main table. The scalar equivalent is: * * for (i=0; i<NUPDATE; i++) { * Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0); * Table[Ran & (TABSIZE-1)] ^= Ran; * } */ MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); while (i < SendCnt) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { NumberReceiving--; } else MPI_Abort( MPI_COMM_WORLD, -1 ); MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); if (pendingUpdates < maxPendingUpdates) { Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B); GlobalOffset = Ran & (tparams.TableSize-1); if ( GlobalOffset < tparams.Top) WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) ); else WhichPe = ( (GlobalOffset - tparams.Remainder) / tparams.MinLocalTableSize ); if (WhichPe == tparams.MyProc) { LocalOffset = (Ran & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= Ran; } else { HPCC_InsertUpdate(Ran, WhichPe, Buckets); pendingUpdates++; } i++; } else { MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } } /* send remaining updates in buckets */ while (pendingUpdates > 0) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } /* send our done messages */ for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) { if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] = MPI_REQUEST_NULL; continue; } /* send garbage - who cares, no one will look at it */ MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG, MPI_COMM_WORLD, tparams.finish_req + proc_count); } /* Finish everyone else up... */ while (NumberReceiving > 0) { MPI_Wait(&inreq, &status); if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses);
  • 19. © 2019 Cray Inc. HPCC RA: MPI kernel comment vs. Chapel 19 /* Perform updates to main table. The scalar equivalent is: * * for (i=0; i<NUPDATE; i++) { * Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0); * Table[Ran & (TABSIZE-1)] ^= Ran; * } */ MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); while (i < SendCnt) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { NumberReceiving--; } else MPI_Abort( MPI_COMM_WORLD, -1 ); MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); if (pendingUpdates < maxPendingUpdates) { Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B); GlobalOffset = Ran & (tparams.TableSize-1); if ( GlobalOffset < tparams.Top) WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) ); else WhichPe = ( (GlobalOffset - tparams.Remainder) / tparams.MinLocalTableSize ); if (WhichPe == tparams.MyProc) { LocalOffset = (Ran & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= Ran; } else { HPCC_InsertUpdate(Ran, WhichPe, Buckets); pendingUpdates++; } i++; } else { MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } } /* send remaining updates in buckets */ while (pendingUpdates > 0) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } /* send our done messages */ for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) { if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] = MPI_REQUEST_NULL; continue; } /* send garbage - who cares, no one will look at it */ MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG, MPI_COMM_WORLD, tparams.finish_req + proc_count); } /* Finish everyone else up... */ while (NumberReceiving > 0) { MPI_Wait(&inreq, &status); if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses); /* Perform updates to main table. The scalar equivalent is: * * for (i=0; i<NUPDATE; i++) { * Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0); * Table[Ran & (TABSIZE-1)] ^= Ran; * } */ MPI Comment forall (_, r) in zip(Updates, RAStream()) do T[r & indexMask].xor(r); Chapel Kernel
  • 20. © 2019 Cray Inc. HPCC RA: Chapel vs. C+MPI 20 0 2 4 6 8 10 12 14 16 32 64 128 256 GUPS Locales (x 36 cores / locale) RA Performance (GUPS) Chapel 1.19 MPI (bucketing) better
  • 21. © 2019 Cray Inc. HPCC RA: MPI vs. Chapel 21 /* Perform updates to main table. The scalar equivalent is: * * for (i=0; i<NUPDATE; i++) { * Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0); * Table[Ran & (TABSIZE-1)] ^= Ran; * } */ MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); while (i < SendCnt) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { NumberReceiving--; } else MPI_Abort( MPI_COMM_WORLD, -1 ); MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); if (pendingUpdates < maxPendingUpdates) { Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B); GlobalOffset = Ran & (tparams.TableSize-1); if ( GlobalOffset < tparams.Top) WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) ); else WhichPe = ( (GlobalOffset - tparams.Remainder) / tparams.MinLocalTableSize ); if (WhichPe == tparams.MyProc) { LocalOffset = (Ran & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= Ran; } else { HPCC_InsertUpdate(Ran, WhichPe, Buckets); pendingUpdates++; } i++; } else { MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } } /* send remaining updates in buckets */ while (pendingUpdates > 0) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } /* send our done messages */ for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) { if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] = MPI_REQUEST_NULL; continue; } /* send garbage - who cares, no one will look at it */ MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG, MPI_COMM_WORLD, tparams.finish_req + proc_count); } /* Finish everyone else up... */ while (NumberReceiving > 0) { MPI_Wait(&inreq, &status); if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses); forall (_, r) in zip(Updates, RAStream()) do T[r & indexMask].xor(r); Chapel Kernel
  • 22. © 2019 Cray Inc. HPCC RA: MPI vs. Chapel 22 /* Perform updates to main table. The scalar equivalent is: * * for (i=0; i<NUPDATE; i++) { * Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0); * Table[Ran & (TABSIZE-1)] ^= Ran; * } */ MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); while (i < SendCnt) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { NumberReceiving--; } else MPI_Abort( MPI_COMM_WORLD, -1 ); MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); if (pendingUpdates < maxPendingUpdates) { Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B); GlobalOffset = Ran & (tparams.TableSize-1); if ( GlobalOffset < tparams.Top) WhichPe = ( GlobalOffset / (tparams.MinLocalTableSize + 1) ); else WhichPe = ( (GlobalOffset - tparams.Remainder) / tparams.MinLocalTableSize ); if (WhichPe == tparams.MyProc) { LocalOffset = (Ran & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= Ran; } else { HPCC_InsertUpdate(Ran, WhichPe, Buckets); pendingUpdates++; } i++; } else { MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } } /* send remaining updates in buckets */ while (pendingUpdates > 0) { /* receive messages */ do { MPI_Test(&inreq, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } } while (have_done && NumberReceiving > 0); MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, tparams.dtype64, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } /* send our done messages */ for (proc_count = 0 ; proc_count < tparams.NumProcs ; ++proc_count) { if (proc_count == tparams.MyProc) { tparams.finish_req[tparams.MyProc] = MPI_REQUEST_NULL; continue; } /* send garbage - who cares, no one will look at it */ MPI_Isend(&Ran, 0, tparams.dtype64, proc_count, FINISHED_TAG, MPI_COMM_WORLD, tparams.finish_req + proc_count); } /* Finish everyone else up... */ while (NumberReceiving > 0) { MPI_Wait(&inreq, &status); if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, tparams.dtype64, &recvUpdates); bufferBase = 0; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (tparams.TableSize - 1)) – tparams.GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { /* we got a done message. Thanks for playing... */ NumberReceiving--; } else { MPI_Abort( MPI_COMM_WORLD, -1 ); } MPI_Irecv(&LocalRecvBuffer, localBufferSize, tparams.dtype64, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq); } MPI_Waitall( tparams.NumProcs, tparams.finish_req, tparams.finish_statuses); forall (_, r) in zip(Updates, RAStream()) do T[r & indexMask].xor(r); Chapel Kernel
  • 23. © 2019 Cray Inc. 23 [Source: Kathy Yelick, CHIUW 2018 keynote: Why Languages Matter More Than Ever]
  • 24. © 2019 Cray Inc. © 2019 Cray Inc. HPCC RA: Chapel vs. C+MPI 37 better © 2018 Cray Inc. ISx: Chapel vs. Reference 109 faster © 2018 Cray Inc. better HPCC STREAM Triad: Chapel vs. Reference 107 24 HPC Patterns: Chapel vs. Reference LCALS STREAM Triad HPCC RA PRK StencilISx Nightly performance tickers online at: https://chapel-lang.org/perf-nightly.html © 2018 Cray Inc. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 pressure_calc energy_calc vol3 d_calc del_ dot_vec_2dcouple fir in it3 m ula ddsubif_quadtra p_in t hydro_1d ic cg in ner_prod band_lin_eq tridia g_elim eos adi in t_predict diff_ predic t first_ sumfirst_ diffpic _2dpic _1dhydro_2d gen_lin_recurdis c_ord m at_ x_m at pla nckia n im p_hydro_2d fin d_first_ m in Serial Kernels (long) LCALS: Chapel vs. Reference 106faster 0 0.5 1 1.5 2 2.5 3 pressure_calc energy_calc vol3 d_calc del_ dot_ve… couple fir in it3 m ula ddsub if_quad tra p_in t pic _2d Parallel Kernels (long) Normalized to reference ref Normalized to reference 1.17.1 © 2018 Cray Inc. better PRK Stencil: Chapel vs. Reference 108 Local loop kernels Embarrassing/Pleasing Parallelism Stencil Boundary Exchanges Global Random Updates Bucket-Exchange Pattern
  • 25. © 2019 Cray Inc. © 2019 Cray Inc. HPCC RA: Chapel vs. C+MPI 37 better © 2018 Cray Inc. ISx: Chapel vs. Reference 109 faster © 2018 Cray Inc. better HPCC STREAM Triad: Chapel vs. Reference 107 25 HPC Patterns: Chapel vs. Reference LCALS STREAM Triad HPCC RA PRK StencilISx Nightly performance tickers online at: https://chapel-lang.org/perf-nightly.html © 2018 Cray Inc. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 pressure_calc energy_calc vol3 d_calc del_ dot_vec_2dcouple fir in it3 m ula ddsubif_quadtra p_in t hydro_1d ic cg in ner_prod band_lin_eq tridia g_elim eos adi in t_predict diff_ predic t first_ sumfirst_ diffpic _2dpic _1dhydro_2d gen_lin_recurdis c_ord m at_ x_m at pla nckia n im p_hydro_2d fin d_first_ m in Serial Kernels (long) LCALS: Chapel vs. Reference 106faster 0 0.5 1 1.5 2 2.5 3 pressure_calc energy_calc vol3 d_calc del_ dot_ve… couple fir in it3 m ula ddsub if_quad tra p_in t pic _2d Parallel Kernels (long) Normalized to reference ref Normalized to reference 1.17.1 © 2018 Cray Inc. better PRK Stencil: Chapel vs. Reference 108
  • 26. © 2019 Cray Inc. A Brief Tour of Chapel Features 26
  • 27. © 2019 Cray Inc. Domain Maps Data Parallelism Task Parallelism Base Language Target Machine Locality Control Chapel language concepts 27 Chapel Feature Areas
  • 28. © 2019 Cray Inc. 28 Base Language Domain Maps Data Parallelism Task Parallelism Base Language Target Machine Locality Control Lower-level Chapel
  • 29. © 2019 Cray Inc. 29 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for f in fib(n) do writeln(f); 0 1 1 2 3 5 8 …
  • 30. © 2019 Cray Inc. 30 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for f in fib(n) do writeln(f); 0 1 1 2 3 5 8 … Configuration declarations (support command-line overrides) ./fib –-n=1000000
  • 31. © 2019 Cray Inc. 31 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for f in fib(n) do writeln(f); 0 1 1 2 3 5 8 … CLU-style iterators Modern iterators Iterators
  • 32. © 2019 Cray Inc. 32 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for f in fib(n) do writeln(f); 0 1 1 2 3 5 8 … Static Type Inference for: • arguments • return types • variables Static Type Inference for: • arguments • return types • variables Static Type Inference for: • arguments • return types • variables Static Type Inference for: • variables • arguments • return types Static type inference for: • arguments • return types • variables
  • 33. © 2019 Cray Inc. 33 Base Language Features, by example 0 1 1 2 3 5 8 … iter fib(n: int): int { var current: int = 0, next: int = 1; for i in 1..n { yield current; current += next; current <=> next; } } Static Type Inference for: • arguments • return types • variables Static Type Inference for: • arguments • return types • variables Static Type Inference for: • variables • arguments • return types config const n: int = 10; for f in fib(n) do writeln(f); Explicit types also supported
  • 34. © 2019 Cray Inc. 34 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for f in fib(n) do writeln(f); 0 1 1 2 3 5 8 …
  • 35. © 2019 Cray Inc. 35 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for (i,f) in zip(0..#n, fib(n)) do writeln("fib #", i, " is ", f); fib #0 is 0 fib #1 is 1 fib #2 is 1 fib #3 is 2 fib #4 is 3 fib #5 is 5 fib #6 is 8 … Zippered iteration
  • 36. © 2019 Cray Inc. 36 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for (i,f) in zip(0..#n, fib(n)) do writeln("fib #", i, " is ", f); fib #0 is 0 fib #1 is 1 fib #2 is 1 fib #3 is 2 fib #4 is 3 fib #5 is 5 fib #6 is 8 … range types and operators Range types and operators
  • 37. © 2019 Cray Inc. 37 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for (i,f) in zip(0..#n, fib(n)) do writeln("fib #", i, " is ", f); fib #0 is 0 fib #1 is 1 fib #2 is 1 fib #3 is 2 fib #4 is 3 fib #5 is 5 fib #6 is 8 … Tuples
  • 38. © 2019 Cray Inc. 38 Base Language Features, by example iter fib(n) { var current = 0, next = 1; for i in 1..n { yield current; current += next; current <=> next; } } config const n = 10; for (i,f) in zip(0..#n, fib(n)) do writeln("fib #", i, " is ", f); fib #0 is 0 fib #1 is 1 fib #2 is 1 fib #3 is 2 fib #4 is 3 fib #5 is 5 fib #6 is 8 …
  • 39. © 2019 Cray Inc. • Object-oriented programming (value- and reference-based) • Managed objects and lifetime checking • Nilable vs. non-nilable class variables • Generic programming / polymorphism • Error-handling • Compile-time meta-programming • Modules (supporting namespaces) • Procedure overloading / filtering • Arguments: default values, intents, name-based matching, type queries • and more… Other Base Language Features 39
  • 40. © 2019 Cray Inc. 40 Task Parallelism and Locality Control Domain Maps Data Parallelism Task Parallelism Base Language Target Machine Locality Control
  • 41. © 2019 Cray Inc. • Locales can run tasks and store variables • Think “compute node” • Number of locales specified on execution command-line > ./myProgram --numLocales=4 # or `–nl 4` Locales, briefly 41 locale locale locale locale Locales: 0 1 2 3 User’s main() executes on locale #0
  • 42. © 2019 Cray Inc. 42 Task Parallelism and Locality, by example taskParallel.chpl const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); prompt> chpl taskParallel.chpl prompt> ./taskParallel Hello from task 2 of 2 running on n1032 Hello from task 1 of 2 running on n1032
  • 43. © 2019 Cray Inc. 43 Task Parallelism and Locality, by example taskParallel.chpl const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); Abstraction of System Resources prompt> chpl taskParallel.chpl prompt> ./taskParallel Hello from task 2 of 2 running on n1032 Hello from task 1 of 2 running on n1032
  • 44. © 2019 Cray Inc. 44 Task Parallelism and Locality, by example taskParallel.chpl const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); High-Level Task Parallelism prompt> chpl taskParallel.chpl prompt> ./taskParallel Hello from task 2 of 2 running on n1032 Hello from task 1 of 2 running on n1032
  • 45. © 2019 Cray Inc. 45 Task Parallelism and Locality, by example taskParallel.chpl const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); prompt> chpl taskParallel.chpl prompt> ./taskParallel Hello from task 2 of 2 running on n1032 Hello from task 1 of 2 running on n1032 So far, this is a shared memory program Nothing refers to remote locales, explicitly or implicitly
  • 46. © 2019 Cray Inc. 46 Task Parallelism and Locality, by example taskParallel.chpl coforall loc in Locales do on loc { const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); } prompt> chpl taskParallel.chpl prompt> ./taskParallel –-numLocales=2 Hello from task 1 of 2 running on n1033 Hello from task 2 of 2 running on n1032 Hello from task 2 of 2 running on n1033 Hello from task 1 of 2 running on n1032
  • 47. © 2019 Cray Inc. 47 Task Parallelism and Locality, by example taskParallel.chpl coforall loc in Locales do on loc { const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); } prompt> chpl taskParallel.chpl prompt> ./taskParallel –-numLocales=2 Hello from task 1 of 2 running on n1033 Hello from task 2 of 2 running on n1032 Hello from task 2 of 2 running on n1033 Hello from task 1 of 2 running on n1032 Abstraction of System Resources
  • 48. © 2019 Cray Inc. 48 Task Parallelism and Locality, by example taskParallel.chpl coforall loc in Locales do on loc { const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); } prompt> chpl taskParallel.chpl prompt> ./taskParallel –-numLocales=2 Hello from task 1 of 2 running on n1033 Hello from task 2 of 2 running on n1032 Hello from task 2 of 2 running on n1033 Hello from task 1 of 2 running on n1032 High-Level Task Parallelism
  • 49. © 2019 Cray Inc. 49 Task Parallelism and Locality, by example taskParallel.chpl coforall loc in Locales do on loc { const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); } Control of Locality/Affinity prompt> chpl taskParallel.chpl prompt> ./taskParallel –-numLocales=2 Hello from task 1 of 2 running on n1033 Hello from task 2 of 2 running on n1032 Hello from task 2 of 2 running on n1033 Hello from task 1 of 2 running on n1032
  • 50. © 2019 Cray Inc. 50 Task Parallelism and Locality, by example taskParallel.chpl coforall loc in Locales do on loc { const numTasks = here.numPUs(); coforall tid in 1..numTasks do writef("Hello from task %n of %n "+ "running on %sn", tid, numTasks, here.name); } prompt> chpl taskParallel.chpl prompt> ./taskParallel –-numLocales=2 Hello from task 1 of 2 running on n1033 Hello from task 2 of 2 running on n1032 Hello from task 2 of 2 running on n1033 Hello from task 1 of 2 running on n1032
  • 51. © 2019 Cray Inc. • atomic / synchronized variables: for sharing data & coordination • begin / cobegin statements: other ways of creating tasks Other Task Parallel Features 51
  • 52. © 2019 Cray Inc. Task Parallelism Base Language Target Machine Locality Control Chapel language concepts 52 Data Parallelism in Chapel Higher-level Chapel Domain Maps Data Parallelism
  • 53. © 2019 Cray Inc. 53 Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 config const n = 1000; var D = {1..n, 1..n}; var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chpl
  • 54. © 2019 Cray Inc. 54 Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 config const n = 1000; var D = {1..n, 1..n}; var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chplDomains (Index Sets)
  • 55. © 2019 Cray Inc. 55 Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 config const n = 1000; var D = {1..n, 1..n}; var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chpl Arrays
  • 56. © 2019 Cray Inc. 56 Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 config const n = 1000; var D = {1..n, 1..n}; var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chpl Data-Parallel Forall Loops
  • 57. © 2019 Cray Inc. 57 Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 config const n = 1000; var D = {1..n, 1..n}; var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chpl So far, this is a shared memory program Nothing refers to remote locales, explicitly or implicitly
  • 58. © 2019 Cray Inc. 58 Distributed Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 --numLocales=4 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 use CyclicDist; config const n = 1000; var D = {1..n, 1..n} dmapped Cyclic(startIdx = (1,1)); var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chpl Domain Maps (Map Data Parallelism to the System)
  • 59. © 2019 Cray Inc. 59 Distributed Data Parallelism, by example prompt> chpl dataParallel.chpl prompt> ./dataParallel –-n=5 --numLocales=4 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 use CyclicDist; config const n = 1000; var D = {1..n, 1..n} dmapped Cyclic(startIdx = (1,1)); var A: [D] real; forall (i,j) in D do A[i,j] = i + (j - 0.5)/n; writeln(A); dataParallel.chpl
  • 60. © 2019 Cray Inc. • Parallel Iterators and Zippering • Slicing: refer to subarrays using ranges / domains • Promotion: execute scalar functions in parallel using array arguments • Reductions: collapse arrays to scalars or subarrays • Scans: parallel prefix operations • Several Domain/Array Types Other Data Parallel Features 60
  • 61. © 2019 Cray Inc. STREAM Triad and HPCC RA Kernel, revisited 61 use …; config const m = 1000, alpha = 3.0; const ProblemSpace = {1..m} dmapped …; var A, B, C: [ProblemSpace] real; B = 2.0; C = 1.0; A = B + alpha * C; forall (_, r) in zip(Updates, RAStream()) do T[r & indexMask].xor(r);
  • 62. © 2019 Cray Inc. Arkouda: NumPy over Chapel 62
  • 63. © 2019 Cray Inc. Motivation: Say you’ve got… …a bunch of Python programmers …HPC-scale problems to solve …access to HPC systems How should you leverage these Python programmers to get your work done? Concept: Develop Python libraries that are implemented in Chapel ⇒ get performance, as with Python-over-C, but also parallelism + scalability Even Better: use NumPy interfaces to make it trivial / familiar for users Arkouda: Context 63
  • 64. © 2019 Cray Inc. Sample use from Jupyter 64
  • 65. © 2019 Cray Inc. By taking this approach, this user was able to: • interact with a distributed, running Chapel program from Python within Jupyter • run the same back-end program on… …a Cray XC …an HPE Superdome X …an Infiniband cluster …a Mac laptop • compute on TB-sized arrays in seconds • with ~1 month of effort Arkouda Accomplishments 65
  • 66. © 2019 Cray Inc. • High level — makes for less code • Great support for array operations and distributed arrays • Direct support for synchronized/atomic variables • Close to “Pythonic” (for a statically typed language) • Provides a gateway for data scientists ready to go beyond Python • Portability and Scalability • Same code runs on shared- or distributed-memory systems • “From Raspberry Pi to Cray XC” • Integrates with (distributed) numerical libraries (e.g., FFTW, FFTW-MPI) “Why Chapel?” 66
  • 67. © 2019 Cray Inc. Summary and Resources 67
  • 68. © 2019 Cray Inc. Chapel cleanly and orthogonally supports… …expression of parallelism and locality …specifying how to map computations to the system Chapel is powerful: • supports succinct, straightforward code • can result in performance that competes with (or beats) C+MPI+OpenMP Chapel is attractive to Python programmers • as a native language: similarly readable / writeable, yet scalable • as an implementation option for Python libraries Summary of this Talk 68
  • 69. © 2019 Cray Inc. Chapel Central 69 https://chapel-lang.org ● downloads ● presentations ● papers ● resources ● documentation
  • 70. © 2019 Cray Inc. https://chapel-lang.org/docs: ~200 pages, including primer examples Chapel Online Documentation 70
  • 71. © 2019 Cray Inc. read-only mailing list: chapel-announce@lists.sourceforge.net (~15 mails / year) Chapel Community 71 https://stackoverflow.com/questions/tagged/chapel https://github.com/chapel-lang/chapel/issues https://gitter.im/chapel-lang/chapel
  • 72. © 2019 Cray Inc. Chapel Social Media (no account required) 72 http://twitter.com/ChapelLanguage http://facebook.com/ChapelLanguage https://www.youtube.com/channel/UCHmm27bYjhknK5mU7ZzPGsQ/
  • 73. © 2019 Cray Inc. Chapel chapter from Programming Models for Parallel Computing • a detailed overview of Chapel’s history, motivating themes, features • published by MIT Press, November 2015 • edited by Pavan Balaji (Argonne) • chapter is also available online Suggested Reading: Chapel history and overview 73
  • 74. © 2019 Cray Inc. Suggested Reading: Recent Progress (CUG 2018) 74 paper and slides available at chapel-lang.org
  • 75. © 2019 Cray Inc. The 6th Annual Chapel Implementers and Users Workshop • sponsored by ACM SIGPLAN • held in conjunction with PLDI 2019, FCRC 2019 • June 22-23, Phoenix AZ Keynote: Anshu Dubey (Argonne) Programming Abstractions for Orchestration of HPC Scientific Computing CHIUW 2019 (https://chapel-lang.org/CHIUW2019.html) 75
  • 76. © 2019 Cray Inc. Community Talks: • Arkouda • hybrid CPU-GPU computations with Chapel • Stencil computations: getting performance for a student exercise • Chapel Graph Library Cray talks: • Chapel’s use in AI / HPO • interoperability improvements for Python, C, and Fortran • recent performance optimizations • radix sort in Chapel Also: state of the project, lightning talks, coding day CHIUW 2019 (https://chapel-lang.org/CHIUW2019.html) 76
  • 77. © 2019 Cray Inc. Chapel cleanly and orthogonally supports… …expression of parallelism and locality …specifying how to map computations to the system Chapel is powerful: • supports succinct, straightforward code • can result in performance that competes with (or beats) C+MPI+OpenMP Chapel is attractive to Python programmers • as a native language: similarly readable / writeable, yet scalable • as an implementation option for Python libraries Summary of this Talk 77
  • 78. © 2019 Cray Inc. S A F E H A R B O R S TAT E M E N T This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. 78
  • 79. THANK YOU Q U E S T I O N S ? @ChapelLanguage chapel-lang.org bradc@cray.com @cray_inc linkedin.com/company/cray-inc-/ cray.com