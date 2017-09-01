User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management and other non-trivial tasks to the system. This general approach is at the heart of the modern Big Data systems, such MapReduce/Spark and SciDB. However, a wide variety of common scientific data operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute with these Big Data systems. In this talk, we will introduce a brand new Big Data system namely ArrayUDF (https://bitbucket.org/arrayudf/arrayudf) for scientific data sets, especially for multi-dimensional arrays. The ArrayUDF allows flexible expressiveness of UDF for scientific data analysis on the strength of their common character--structural locality. ArrayUDF executes the UDF directly on arrays stored in files, such as HDF5, without any data load overload. ArrayUDF's desi

gn and implementation considerations for parallel data processing on large-scale HPC will also be introduced. The performance tests on Edison at NERSC show that ArrayUDF is around 2000X faster than Spark on processing large scientific datasets.



