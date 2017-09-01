CS/NERSC Data Seminar, September 1, 2017 ArrayUDF: User-Defined Scientific Data Analysis on Arrays Bin Dong1, Kesheng Wu1,...
CS/NERSC Data Seminar, September 1, 2017 Clarion call for large scale data analysis in modern scientific activities Exampl...
CS/NERSC Data Seminar, September 1, 2017 Q1: How many data analysis operations are being or will be developed ?
CS/NERSC Data Seminar, September 1, 2017 Avg Growth Python modules 72/day R packages 10/day Java packages 108/day Q1: How ...
CS/NERSC Data Seminar, September 1, 2017 Q2: What are the functions of these data analysis operations ? Q1: How many data ...
CS/NERSC Data Seminar, September 1, 2017 Q2: What are the functions of these data analysis operations ? Variety Q1: How ma...
CS/NERSC Data Seminar, September 1, 2017 Two common methods to develop data analysis operations with large population and ...
CS/NERSC Data Seminar, September 1, 2017 UDF is at heart of modern big data system Examples: MapReduce in Apache Hadoop an...
CS/NERSC Data Seminar, September 1, 2017 MapReduce is not an optimal fit for scientific data analysis Reason 1: most scien...
CS/NERSC Data Seminar, September 1, 2017 MapReduce is not an optimal fit for scientific data analysis (continued) Reason 1...
CS/NERSC Data Seminar, September 1, 2017 ArrayUDF: user-defined scientific data analysis on arrays • Stencil-based user-de...
CS/NERSC Data Seminar, September 1, 2017 • Stencil (S) is a structure representation for a set of neighborhood cells Stenc...
CS/NERSC Data Seminar, September 1, 2017 Stencil-based UDF(continued) • is an arbitrary user-defined function -|S| = 1, us...
CS/NERSC Data Seminar, September 1, 2017 Examples of using ArrayUDF Tem_avg(Stencil t): return (t(-30)+ … t(30))/60 Three ...
CS/NERSC Data Seminar, September 1, 2017 Examples of using ArrayUDF (Continued) Example 2: vorticity computation in fluid ...
CS/NERSC Data Seminar, September 1, 2017 Optimized performance of ArrayUDF T_overall = T_I/O + T_computing + T_communicati...
CS/NERSC Data Seminar, September 1, 2017 ArrayUDF minimizes I/O cost via smart chunking • Factors considered in chunking :...
CS/NERSC Data Seminar, September 1, 2017 ArrayUDF dynamically builds ghost zone to avoid communication • What is ghost zon...
CS/NERSC Data Seminar, September 1, 2017 Evaluations • Hardware: -Edison, a Cray XC30 supercomputer at NERSC -5576 computi...
CS/NERSC Data Seminar, September 1, 2017 Comparison with peer systems with standard “window” operators • “window” comes fr...
CS/NERSC Data Seminar, September 1, 2017 Comparison with Spark in real scientific data analysis with generic UDF interface...
CS/NERSC Data Seminar, September 1, 2017 ArrayUDF at NERSC module load arrayudf/1.0 module show arrayudf/1.0 Examples: htt...
CS/NERSC Data Seminar, September 1, 2017 Conclusions • ArrayUDF: User-defined scientific data analysis on arrays -Stencil ...
CS/NERSC Data Seminar, September 1, 2017 Acknowledgments • Nicholas Chaimov from University of Oregon for suggestions to s...
CS/NERSC Data Seminar, September 1, 2017 Thanks Bin Dong dbin@lbl.gov http://crd.lbl.gov//dongbin
CS/NERSC Data Seminar, September 1, 2017 Backup Slides
CS/NERSC Data Seminar, September 1, 2017 Stencil-based computing model vs. others Input Output UDF SQL DBMS Tuple t Tuple ...
CS/NERSC Data Seminar, September 1, 2017 Chunking strategy evaluation 2D Dataset (100000, 100000) Square chunk (1K, 1K) • ...
CS/NERSC Data Seminar, September 1, 2017 Trail-run overhead • Detect ghost zone size automatically • Run the UDF on a sing...
    • ArrayUDF: User-Defined Scientific Data Analysis on Arrays

    1. 1. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF: User-Defined Scientific Data Analysis on Arrays Bin Dong1, Kesheng Wu1, Surendra Byna1, Jialin Liu1 Weijie Zhao2, Florin Rusu1,2 1LBNL, Berkeley, CA 2UC Merced, Merced, CA CS/NERSC Data Seminar, September 1, 2017
    2. 2. CS/NERSC Data Seminar, September 1, 2017 Clarion call for large scale data analysis in modern scientific activities Example: scientific projects for supernovae， dark matter/energy, etc. Data source: Rick White, J. Hart, R. Cutri, Ian Foster, C. J. Grillmair, etc. SIGMOD’16 SIGMOD’17 SIGMOD’14
    3. 3. CS/NERSC Data Seminar, September 1, 2017 Q1: How many data analysis operations are being or will be developed ?
    4. 4. CS/NERSC Data Seminar, September 1, 2017 Avg Growth Python modules 72/day R packages 10/day Java packages 108/day Q1: How many data analysis operations are being or will be developed ? Implication from popular data analysis languages *Data from http://www.modulecounts.com/ on Aug 30 2017 Large population
    5. 5. CS/NERSC Data Seminar, September 1, 2017 Q2: What are the functions of these data analysis operations ? Q1: How many data analysis operations are being or will be developed ? Large population
    6. 6. CS/NERSC Data Seminar, September 1, 2017 Q2: What are the functions of these data analysis operations ? Variety Q1: How many data analysis operations are being or will be developed ? Large population
    7. 7. CS/NERSC Data Seminar, September 1, 2017 Two common methods to develop data analysis operations with large population and variety For each operation P Do Develop P’s : - Data management - Expression execution - Other components: parallel, communication cache, etc. End For Redundant Diverse Customized Solutions ✔ ✗ ✗Redundant May lack expertise of the underlying systems to tune its performance ✗ UDF API - Data management - Generic exec. engine - Other components: parallel, comm., cache, etc. Diverse One single & shared ✔ ✔ Operation expression 1 User-defined Functions (UDF) Professionally tuned ✔
    8. 8. CS/NERSC Data Seminar, September 1, 2017 UDF is at heart of modern big data system Examples: MapReduce in Apache Hadoop and Spark MAP() MAP() MAP() reduce() reduce() UDF to generate (key, value) pairs UDF to merge (key, values) pairs shuffle/sort Input Data Output Data
    9. 9. CS/NERSC Data Seminar, September 1, 2017 MapReduce is not an optimal fit for scientific data analysis Reason 1: most scientific data are multi-dimensional arrays Pictures Credit: Kyle Hemes, Peter Nugent, Suren Byna, etc. Converting array to (key, value) is expensive because of explicitly handling coordinate
    10. 10. CS/NERSC Data Seminar, September 1, 2017 MapReduce is not an optimal fit for scientific data analysis (continued) Reason 1: most scientific data are multi-dimensional arrays Reason 2: most scientific data analysis operations own structure locality property Structure locality: The analysis operation on a single cell accesses its neighborhood cells Map deals with a single element at a time Moving Average 2D Poisson Equation Solver (Discrete) Reduce requires to duplicate each cell for all neighborhood cells (~x # of neighbors) Reduce only happens after expensive shuffle Converting array to (key, value) is expensive
    11. 11. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF: user-defined scientific data analysis on arrays • Stencil-based user-defined function API Structural locality aware array operations • Native multidimensional array data model In-situ data processing in scientific data formats, e.g., HDF5 • Optimal and automatic chunking and ghost zone handling method Fast large array processing in parallel & out-of-core manner Processing Element Chunk Ghost zone Storage System PE0 PE1 PE2 PE3 ArrayUDF
    12. 12. CS/NERSC Data Seminar, September 1, 2017 • Stencil (S) is a structure representation for a set of neighborhood cells Stencil-based UDF s0,0 s0,1 s1,0 s1,-1 s0,-1 s-1,-1 s-1,0 s-1,1 s1,1 Materialized structure locality Flexible UDF expression by manipulating each neighborhood cell independently 2D Example: - S has a center where computing happens - The size of |S| is not fixed - Notations for set member stands for the cell at offset from center point d1,d2,···sd1,d2,··· i, j,···
    13. 13. CS/NERSC Data Seminar, September 1, 2017 Stencil-based UDF(continued) • is an arbitrary user-defined function -|S| = 1, user-defined function of a single cell i.e., map in MapReduce -|S| > 1, user-defined aggregation of a set of cells, i.e., reduce in MapReduce f Si, j,···( )® c' i, j,··· A’A f
    14. 14. CS/NERSC Data Seminar, September 1, 2017 Examples of using ArrayUDF Tem_avg(Stencil t): return (t(-30)+ … t(30))/60 Three steps by using ArrayUDF: Example 1: moving average in time series data Global temperature trend filtered by moving average at 60 years’ interval from 1908 to 2008 T.Apply(Tem_avg, T’) Array T(“data location pointer”) Step 1: Initialize data Step 2: Define operation on Stencil Step 3: Run & get result T’
    15. 15. CS/NERSC Data Seminar, September 1, 2017 Examples of using ArrayUDF (Continued) Example 2: vorticity computation in fluid flow VC_X(Stencil u): return u(0,1)- u(0, -1) VC_Y(Stencil v): return v(1,0)- u(-1, 0) V_X.Apply(VC_X, V_X’) V_Y.Apply(VC_Y , V_Y’) V_X’+V_Y’ as vorticity Modeling renewable energy Combustion engines Pictures credit to: LANL, Frank Fritz Michael Milthaler, etc. Array V_X(“data location pointer”) Array V_y(“data location pointer”) Step 1: Initialize data (2D example) Step 2: Define operation on Stencil Step 3: Run & get result Three steps by using ArrayUDF :
    16. 16. CS/NERSC Data Seminar, September 1, 2017 Optimized performance of ArrayUDF T_overall = T_I/O + T_computing + T_communication minimized(T_I/O) Constant = 0 (avoided) ArrayUDF T_overall: overall time to run a data analysis operation T_I/O: time of reading/writing data T_computing: time of execute operation expression T_communication: time of communications
    17. 17. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF minimizes I/O cost via smart chunking • Factors considered in chunking : physical layout, logical shape, size, etc. • Two chunking strategies: - Layout unknown, i.e., average case of all possible layouts - Select square shaped chunk to minimize ghost cells/chunk - Layout known in advance, i.e., row-major, the most popular one Select contiguous chunking to maximize I/O on contiguous cells, including ghost cells Cell Ghost Cell Chunk Row−major order Chunk Square Shape Non−Square Shape # of ghost cells = 12 # of ghost cells = 20 Ghost Cell Chunk Cell See theoretical analysis in the ArrayUDF paper at HPDC’17
    18. 18. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF dynamically builds ghost zone to avoid communication • What is ghost zone and why ? - Ghost zone are extra cells surrounding a chunk - Motivated by structure locality • When to build ghost zone? - Ghost zone is built when chunk is read from disk into memory • How to determine the size of ghost zone ? User-defined Trail-run: execute the UDF code on a special Stencil instance to collect the offsets used by UDF Size of ghost zone = maximum of collected offsets
    19. 19. CS/NERSC Data Seminar, September 1, 2017 Evaluations • Hardware: -Edison, a Cray XC30 supercomputer at NERSC -5576 computing nodes, 24 cores/node, 64GB DDR3 Memory • Software - ArrayUDF - RasDaMan 9.5 (sequential version) - Spark 1.5.0 - EXTASCID(hand-optimized version) - SciDB 16.9 - Hand-optimized C/C++ code • Workloads - Two synthetic data sets (i.e., 2D and 3D) for micro benchmarks  Window operators, chunking strategy, trail-run, etc. - Four real scientific data sets (i.e., S3D, MSI , VPIC , CoRTAD)  Overall performance tests /w generic UDF interface
    20. 20. CS/NERSC Data Seminar, September 1, 2017 Comparison with peer systems with standard “window” operators • “window” comes from SciDB and RasDaMan, where a operator is applied to all window members uniformly Average on - window 2x2 for 2D - window 2x2x2 for 3D • ArrayUDF has close performance to hand-optimized code • ArrayUDF is as much as 384X faster than peer systems
    21. 21. CS/NERSC Data Seminar, September 1, 2017 Comparison with Spark in real scientific data analysis with generic UDF interface Spark experiences out-of-memory: - large data size - more local cells S3D Vorticity comp. 301GB 2 local cells/op. MSI Laplacian op. 21GB 4 local cells/op. VPIC Tri interpolation 36GB 8 local cells/op. CoRTAD Moving average 225GB 4 local cells/op. DataSize # of local cells used by UDF We observed ArrayUDF is 2070X faster
    22. 22. CS/NERSC Data Seminar, September 1, 2017 ArrayUDF at NERSC module load arrayudf/1.0 module show arrayudf/1.0 Examples: https://bitbucket.org/arrayudf/
    23. 23. CS/NERSC Data Seminar, September 1, 2017 Conclusions • ArrayUDF: User-defined scientific data analysis on arrays -Stencil based UDF API for structural locality-aware operations -Native array model & In-situ array processing in HDF5, etc. -Auto & Optimal chunking and ghost zone methods for parallel or out-of-core array processing • ArrayUDF provides close performance to hand-optimized code • ArrayUDF is as much as 2070X faster than Spark • ArrayUDF is easy-to-use data analysis system • ArrayUDF source code: https://bitbucket.org/arrayudf/ • Future work -Python and other language interface (Done) -more in-situ formats: NetCDF, PnetCDF, ADIOS, etc.
    24. 24. CS/NERSC Data Seminar, September 1, 2017 Acknowledgments • Nicholas Chaimov from University of Oregon for suggestions to set up Spark on Edison at NERSC • Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, support for the SDS project and a DOE Career award (Program manager: Dr. Lucy Nowell) under contract number DE-AC02-05CH11231 • National Energy Research Scientific Computing Center
    25. 25. CS/NERSC Data Seminar, September 1, 2017 Thanks Bin Dong dbin@lbl.gov http://crd.lbl.gov//dongbin
    26. 26. CS/NERSC Data Seminar, September 1, 2017 Backup Slides
    27. 27. CS/NERSC Data Seminar, September 1, 2017 Stencil-based computing model vs. others Input Output UDF SQL DBMS Tuple t Tuple t’ t’=f(t) SciDB Cell c Cell c’ c’=f(c) MapReduce KeyValue kv KeyValue kv’’ kv’=Map(kv) kv’’=Reduce(kv’1, kv’2, …) ArrayUDF Stencil s Cell c’ c’=f(s) vs. MapReduce: ArrayUDF generalizes map and reduce as a single operation vs. SciDB: SciDB has ‘window’, similar to Stencil. But, the ‘window’ usually applies an operator uniformly on all cells involved. In ArrayUDF, cell with stencil can be applied with different operator.
    28. 28. CS/NERSC Data Seminar, September 1, 2017 Chunking strategy evaluation 2D Dataset (100000, 100000) Square chunk (1K, 1K) • Squared chunking (for average cases) - minimize ghost cells # to reduce I/O cost • Contiguous chunking (for row-major layout) - maximize contiguous disk read Ghost zone has ignorable impact
    29. 29. CS/NERSC Data Seminar, September 1, 2017 Trail-run overhead • Detect ghost zone size automatically • Run the UDF on a single Stencil but the UDF might access more neighborhood cells Unit: microsecond ≈ 1 ms when 256 cells are used in the UDF

