Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HPAT.jl - Easy and Fast Big Data
Analytics
1
Ehsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana Shpeisman
Parallel Com...
High Performance Analytics Toolkit (HPAT): a compiler-based
framework for big data analytics and machine learning
• Goal: ...
Logistic regression example:
3
Let’s do Data Science
function logistic_regression(iterations)
points = …some small data…
r...
• Challenges:
• Long execution time
• Data doesn’t fit in memory
• Solution:
• Parallelism: cluster or cloud
• How:
• MPI/...
5
MPI/C++ (“gold standard”)
herr_t ret;
// set up file access property list with parallel I/O access
hid_t plist_id = H5Pc...
6
MPI/Julia
• MPI.jl
• Pros:
• Less effort than C++
• Cons:
• Still need to understand parallelism,
MPI
• Parallel I/O
• N...
7
Apache Spark
• Map/reduce library
• Master-executer model
• Pros:
• Easier than MPI/C++
• Lots of “system” features
• Co...
8
HPAT (“smart compiler”)
• Julia code → MPI/C++ “gold
standard”
• “Smart parallelizing compiler”
doesn’t exist, but…
• Ob...
using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", fi...
double* logistic_regression(int64_t iterations, char* file)
{
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi...
11
HPAT usage
• Dependencies:
• ParallelAccelerator, Parallel HDF5,
MPI
• “@acc hpat” function annotation
• Use matrix/vec...
12
HPAT limitations
• HPAT can fail to parallelize!
• Limitations in compiler analysis
• Needs good coding style
• Fallbac...
Init(state) # assume all arrays are partitioned
while isChanged(state)
inferArrayDistribution(state, node)
end
function in...
14
HPAT Compiler Pipeline
Macro-Pass Domain-Pass
Distributed-Pass
HPAT Code
Generation (MPI)
Domain-IR
Julia source
Parall...
• MacroPass
• “desugar” extensions like DataSource
• DomainPass
• Generate variables, allocations, and function calls
• En...
16
HPAT vs. Spark
47
43
84
1061
182
1.55
0.69
0.05
11.13
7.95
0.01
0.1
1
10
100
1000
10000
1D SUM 1D SUM
FILTER
MONTE
CARL...
17
Strong Scaling- Logistic Regression
1061
800
634
11.13
5.89
4.11
1
2
4
8
16
32
64
128
256
512
1024
2048
64 (2k) 128 (4k...
...
kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock,...
• Compiler development experience
• The good:
• Built-in type inference
• Interospection/full control
• The bad:
• No deta...
• Structured data processing without SQL!
• Complex analytics all in array syntax
• Inspired by TPCx-BB examples
• Array s...
21
Summary
• High Performance Analytics Toolkit (HPAT) provides scripting
abstractions and “bare-metal” performance
• Matr...
Backup
22
• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most c...
• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most c...
• 1D partitioning of data per nodes
• Domain-specific heuristic for machine learning, analytics
• Not valid for other doma...
26
Example: Distributed Data Source Transformation Flow
points = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5")
po...
• Parallel I/O example (HDF5, MPI/C++):
27
Backend Code Complexity
herr_t ret;
// set up file access property list with pa...
28
Libraries
59
176
43
21
26
23
K-MEANS (LIB) LINEAR REGRESSION (LIB) NAÏVE BAYES (LIB)
EXECUTIONTIME(S)
Spark-Mllib HPAT-...
29
Benchmarks
• Cori at LBL/NERSC
• Dual Haswell nodes
• Cray Aries (Dragonfly) network
• 64 nodes (2048 cores) used
• Spa...
D = 10 # Number of dimensions
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file)....
using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", fi...
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
n = 100000 * partitions
...
double calcPi(int64_t N) {
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WO...
Upcoming SlideShare
Loading in …5
×

HPAT presentation at JuliaCon 2016

1,043 views

Published on

High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.

HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.

Published in: Technology
  • Be the first to comment

HPAT presentation at JuliaCon 2016

  1. 1. HPAT.jl - Easy and Fast Big Data Analytics 1 Ehsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana Shpeisman Parallel Computing Lab, Intel Labs *Intern from UIUC JuliaCon 2016
  2. 2. High Performance Analytics Toolkit (HPAT): a compiler-based framework for big data analytics and machine learning • Goal: efficient large-scale analytics without sacrificing programmer productivity • Array-style programming • High performance • Built on ParallelAccelerator • Domain-specific compiler heuristics for parallelization • Use efficient HPC stack (e.g. MPI) • Bridge the enormous productivity-performance gap 2 HPAT overview HPAT.jl https://github.com/IntelLabs/HPAT.jl
  3. 3. Logistic regression example: 3 Let’s do Data Science function logistic_regression(iterations) points = …some small data… responses = … D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return w end gepsoft.com
  4. 4. • Challenges: • Long execution time • Data doesn’t fit in memory • Solution: • Parallelism: cluster or cloud • How: • MPI/C++ (”gold standard”), MPI/Julia, Spark (library) • HPAT (“smart compiler”) 4 What about large data? udel.edu
  5. 5. 5 MPI/C++ (“gold standard”) herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1); // set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1); // open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL); hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double)); // create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1); ... • Message Passing Interface (MPI) • Pros: • Best performance! • Cons: • Need to understand parallelism, MPI, parallel I/O, C++ • High effort, tedious, error-prone, not readable… Example code (not readable):
  6. 6. 6 MPI/Julia • MPI.jl • Pros: • Less effort than C++ • Cons: • Still need to understand parallelism, MPI • Parallel I/O • Needs high performance Julia • Infrastructure challenges Example code: function main() MPI.Init() … MPI.Allreduce!(send_arr, recv_arr, MPI.SUM, MPI.COMM_WORLD) … MPI.Finalize() end
  7. 7. 7 Apache Spark • Map/reduce library • Master-executer model • Pros: • Easier than MPI/C++ • Lots of “system” features • Cons: • Development effort • Very slow • 100x slower than MPI/C++ • Host language overheads, loses locality, high library overheads, etc. Infoobjects.com Example code (Python): if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] X = matrix[:, 1:] return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1) def add(x, y): x += y return x for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add)
  8. 8. 8 HPAT (“smart compiler”) • Julia code → MPI/C++ “gold standard” • “Smart parallelizing compiler” doesn’t exist, but… • Observations: • Array code is implicitly parallel • Parallelism is simple for data analytics • map/reduce pattern • 1D decomposition, allreduce communication HPAT.jl https://github.com/IntelLabs/HPAT.jl points labels 1D decomposition points labels w w
  9. 9. using HPAT @acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return w end weights = logistic_regression(100,”mydata.hdf5”) $ mpirun –np 64 julia logistic_regression.jl 9 Logistic Regression (HPAT) https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl 95x speedup over Spark Parallel I/O
  10. 10. double* logistic_regression(int64_t iterations, char* file) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank∗(N/mpi_nprocs); int myend = (mpi_rank+1)∗(N/mpi_nprocs); double *points = (double*)malloc((myend-mystart)*D*sizeof(double)); … ret = H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, …); ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, …); for(i=mystart ; i<myend ; i++) { … w_local = … } MPI_Allreduce(w, w_local, D , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD); return w; } 10 Logistic Regression (generated) Initialization, partitioning, allocation Parallel I/O Computation Parameter synchronization “Bare-metal” MPI/C++ code with near-zero overhead!
  11. 11. 11 HPAT usage • Dependencies: • ParallelAccelerator, Parallel HDF5, MPI • “@acc hpat” function annotation • Use matrix/vector operations, comprehensions • ParallelAccelerator operations • No “bad” for loops • Column-major matrices • All of program inside HPAT • I/O using DataSource HPAT.jl https://github.com/IntelLabs/HPAT.jl points labels 1D decomposition points labels w w
  12. 12. 12 HPAT limitations • HPAT can fail to parallelize! • Limitations in compiler analysis • Needs good coding style • Fallback: explicit map/reduce, @parfor code • Only map/reduce parallel pattern supported • Data analytics, machine learning, optimization etc. • Others like stencils (PDEs) not supported yet • No sparse matrices yet • HDF5 and text file format HPAT.jl https://github.com/IntelLabs/HPAT.jl
  13. 13. Init(state) # assume all arrays are partitioned while isChanged(state) inferArrayDistribution(state, node) end function inferArrayDistribution(state, node) if isAssignment(node) # lhs and rhs are sequential if either is sequential seq = isSeq(state, lhs) || isSeq(state, rhs) isSeq(state, lhs) = isSeq(state, rhs) = seq elseif isGEMM(node) # e.g. w = labels*points’ - shared parameter synchronization heuristic isSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) && isTransposed(in2) elseif isHPATcall(node) handleHPATcall(node) else # unknown call, assume all arrays are sequential isSeq(state, nodeArrs) = true end 13 Array Distribution Inference points labels w
  14. 14. 14 HPAT Compiler Pipeline Macro-Pass Domain-Pass Distributed-Pass HPAT Code Generation (MPI) Domain-IR Julia source Parallel-IR CGen Julia Compiler MPI/C++ source Backend Compiler (ICC/GCC)
  15. 15. • MacroPass • “desugar” extensions like DataSource • DomainPass • Generate variables, allocations, and function calls • Enable ParallelAccelerator pipeline • DistributedPass • Infer partitioned vs. sequential arrays • Divide allocations, parallelize I/O, and computation • “hook” distributed-memory libraries • Backend code generation • MPI code extension for CGen 15 HPAT Compiler Pipeline Macro-Pass Domain-Pass Distributed-Pass HPAT Code Generation (MPI)
  16. 16. 16 HPAT vs. Spark 47 43 84 1061 182 1.55 0.69 0.05 11.13 7.95 0.01 0.1 1 10 100 1000 10000 1D SUM 1D SUM FILTER MONTE CARLO PI LOGISTIC REGRESSION K-MEANS EXECUTIONTIME(S) Spark HPAT 1680x 95x 23x30x 62x Cori at NERSC/LBL 64 nodes (2048 cores)
  17. 17. 17 Strong Scaling- Logistic Regression 1061 800 634 11.13 5.89 4.11 1 2 4 8 16 32 64 128 256 512 1024 2048 64 (2k) 128 (4k) 256 (8k) ExecutionTime(s) Nodes (Cores) Scaling of Spark vs. HPATSpark HPAT 95x 154x
  18. 18. ... kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock, rankId * nVectorsInBlock); localInit.input.set(kmeans::init::data, dataSource.getNumericTable()); localInit.compute(); services::SharedPtr<byte> serializedData; InputDataArchive dataArch; localInit.getPartialResult()->serialize( dataArch ); size_t perNodeArchLength = dataArch.getSizeOfArchive(); if (rankId == mpi_root) { serializedData = services::SharedPtr<byte>( new byte[ perNodeArchLength * nBlocks ] ); } byte *nodeResults = new byte[ perNodeArchLength ]; dataArch.copyArchiveToArray( nodeResults, perNodeArchLength ); MPI_Gather( nodeResults, perNodeArchLength, MPI_CHAR, serializedData.get(), perNodeArchLength, MPI_CHAR, mpi_root, MPI_COMM_WORLD); …. using HPAT @acc hpat function calcKmeans(k, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) clusters = HPAT.Kmeans(points, k) return clusters end 18 Machine Learning Libraries Intel® DAAL as backend (not readable)
  19. 19. • Compiler development experience • The good: • Built-in type inference • Interospection/full control • The bad: • No detailed AST definition • Julia compiler surprises • Long HPAT/ParallelAccelerator compilation time! • Type inference every time 19 Building HPAT in Julia
  20. 20. • Structured data processing without SQL! • Complex analytics all in array syntax • Inspired by TPCx-BB examples • Array syntax for table operations • Join, filter, aggregate • Similar to DataFrames.jl • Interesting compiler challenges • Optimize general AST instead of SQL trees • Other use cases • 2D decomposition 20 Ongoing HPAT development customer_i_class = aggregate(sale_items, :ss_customer_sk, :ss_item_count = length(:ss_item_sk), :id1 = sum(:i_class_id==1), :id2 = sum(:i_class_id==2), :id3 = sum(:i_class_id==3), :id4 = sum(:i_class_id==4)) Example:
  21. 21. 21 Summary • High Performance Analytics Toolkit (HPAT) provides scripting abstractions and “bare-metal” performance • Matrix/vector operations, extension for Parallel I/O • Domain-specific compiler techniques • Generates efficient MPI/C++ • Uses existing HPC libraries • Much easier and faster than alternatives • Get involved! • Any contributions welcome • Need more and more use cases HPAT.jl https://github.com/IntelLabs/HPAT.jl
  22. 22. Backup 22
  23. 23. • Goal: gain new insight from large datasets by domain experts • Productivity is 1st priority • Scripting languages most common, fast development • MPI/C++ not acceptable • Apache Hadoop and Spark dominant • Intuitive user interface (MapReduce, Python) • Master-executer library approach • Library approach is slow • Loses locality, high overheads • Orders of magnitude slower than handwritten MPI/C++ 23 Big Data Analytics is Slow Infoobjects.com http://hadoop.apache.org/ F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015. K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
  24. 24. • Goal: gain new insight from large datasets by domain experts • Productivity is 1st priority • Scripting languages most common, fast development • MPI/C++ not acceptable • Apache Hadoop and Spark dominant • Intuitive user interface (MapReduce, Python) • Master-executer library approach • Library approach is slow • Loses locality, high overheads • Orders of magnitude slower than handwritten MPI/C++ 24 Big Data Analytics is Slow Infoobjects.com http://hadoop.apache.org/ F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015. K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
  25. 25. • 1D partitioning of data per nodes • Domain-specific heuristic for machine learning, analytics • Not valid for other domains! • Column partitioning for matrices • Julia is column major • Needs good coding style by user • 1D partitioning of parfor iterations per node • Handle distributed-memory libraries • Generate necessary input/output transformations • Intel® Data Analytics Acceleration Library (Intel® DAAL) • Machine learning algorithms similar to Spark’s MLlib 25 DistributedPass: parallelization
  26. 26. 26 Example: Distributed Data Source Transformation Flow points = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5") points::Matrix{Float64} = HPAT_h5_source("/points",“data.hdf5") h5_size1, h5_size2 = HPAT_h5_sizes("/points",“data.hdf5") points = alloc(h5_size1, h5_size2) HPAT_h5_read(points, "/points",“data.hdf5") # Enable Julia type inference # Enable ParallelAccelerator points = alloc(h5_size1, h5size_2/nprocs) HPAT_h5_read(points,"/points",“data.hdf5“,h5_size1,h5size_2/nprocs) # Parallelize H5Sselect_hyperslab(h5_size1, h5size_2/nprocs,…) H5Dread(points,"/points",“data.hdf5“,…) # Backend code Marco-Pass Domain-Pass Distributed-Pass HPAT-CGen
  27. 27. • Parallel I/O example (HDF5, MPI/C++): 27 Backend Code Complexity herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1); // set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1); // open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL); hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double)); // create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1); /* create a memory dataspace independently */ hid_t mem_dataspace = H5Screate_simple (1, &my_sptprice_size, NULL); assert (mem_dataspace != -1); /* set up the collective transfer properties list */ hid_t xfer_plist = H5Pcreate (H5P_DATASET_XFER); assert(xfer_plist != -1); ret = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE); assert(ret != -1); /* read data collectively */ ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, mem_dataspace, my_dataspace, xfer_plist, my_sptprice); assert(ret != -1); ... Hard to write manually!
  28. 28. 28 Libraries 59 176 43 21 26 23 K-MEANS (LIB) LINEAR REGRESSION (LIB) NAÏVE BAYES (LIB) EXECUTIONTIME(S) Spark-Mllib HPAT-DAAL 2.8x 6.7x 1.9x
  29. 29. 29 Benchmarks • Cori at LBL/NERSC • Dual Haswell nodes • Cray Aries (Dragonfly) network • 64 nodes (2048 cores) used • Spark 1.6.0 (default Cori installation) • Benchmarks • 1D_sum: sums 8.5 billion element vector from file • Pi: 1 billion random points • Logistic regression: 2 billion samples,10 features SP • K-Means: 320 million 20-feature DP, 10 iterations, 5 centers
  30. 30. D = 10 # Number of dimensions if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] # point labels (first column of input file) X = matrix[:, 1:] # point coordinates return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1) def add(x, y): x += y return x for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add) 30 Logistic Regression (Spark) https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py Scheduling, TCP/IP overheads Python overheads
  31. 31. using HPAT @acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return w end weights = logistic_regression(100,”mydata.hdf5”) $ mpirun –np 64 julia logistic_regression.jl 31 Logistic Regression (HPAT) https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl 95x speedup!
  32. 32. from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext(appName="PythonPi") n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 – 1 return 1 if x ** 2 + y ** 2 < 1 else 0 def add(x, y): x += y return x count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop() 32 Monte Carlo Pi (Spark) https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py Scheduling overheads Extra array map reduce
  33. 33. double calcPi(int64_t N) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank∗(N/mpi_nprocs); int myend = (mpi_rank+1)∗(N/mpi_nprocs); for(i=mystart ; i<myend ; i++) { x = rand(..); y = rand(..); sum_local += … } MPI_Reduce(…); return out; } using HPAT @acc hpat function calcPi(n) x = rand(n) .* 2.0 .- 1.0 y = rand(n) .* 2.0 .- 1.0 return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/n end myPi = calcPi(10^9) $ mpirun –np 64 julia pi.jl 33 Monte Carlo Pi (HPAT) https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl 1600x speedup! Computation done in registers

×