A Parallel Implementation of the Element-Free Galerkin Method on a Network of PCs


Published on

Issue Date: Apr-2000

Type: Thesis

Publisher: Asian Institute of Technology

Abstract: The element-free Galerkin method (EFGM) is a recently developed numerical technique for solving problems in a wide range of application areas including solid and fluid mechanics. The primary benefit of these methods is the elimination of the need for meshing (or remeshing) complex three-dimensional problem domains. With EFGM, the discrete model of the object is completely described by nodes and a description of the problem domain boundary. However, the elimination of meshing difficulties does not come freely since the EFGM is much more computationally expensive than the finite element method (FEM), especially for three-dimensional and non-linear applications. Parallel processing has long been an available technique to improve the performance of scientific computing programs, including the finite element method. With efficient programming, parallel processing can overcome the high computing time that is typically required in analyses employing EFGM or other meshless methods. This work focuses on the application of the concepts in parallel processing to EFGM analyses, particularly in the formulation of the stiffness matrix, the assembly of the system of discrete equations, and the solution for nodal unknowns, so that the time required for EFGM analyses is reduced. Several low-cost personal computers are joined together to form a parallel computer with the potential for raw computing power comparable to that of the fastest serial computers. The processors communicate via a local high-speed network using the Message Passing Interface (MPI), a standard library of functions that enables parallel programs to be executed on and communicate efficiently over a variety of machines. To provide a comparison between the parallelized and the serial versions of the EFGM computer program, several benchmark 3D structural mechanics problems are analyzed to show that the parallelized EFGM program can provide substantially shorter run time than the serial program without loss of solution accuracy.

URI: http://dspace.siu.ac.th/handle/1532/134

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Parallel Implementation of the Element-Free Galerkin Method on a Network of PCs

  1. 1. A PARALLEL IMPLEMENTATION OF THEELEMENT-FREE GALERKIN METHOD ON A NETWORK OF PCs by Thiti VacharasintopchaiA thesis submitted in partial fulfillment of the requirements for the degree ofMaster of EngineeringExamination Committee Dr. William J. Barry (Chairman) Professor Worsak Kanok-Nukulchai Dr. Pennung WarnitchaiNationality ThaiPrevious Degree Bachelor of Civil Engineering, Chulalongkorn University, Bangkok, ThailandScholarship Donor Asian Institute of Technology Partial Scholarship Asian Institute of Technology School of Civil Engineering Bangkok, Thailand April 2000 i
  2. 2. ACKNOWLEDGMENT I would like to express profound gratitude to Dr. William J. Barry, my advisor, whoalways gave invaluable guidance, inspirational suggestions, encouragements and supportsall the way through this research. I would like to express sincere appreciation toProfessor Worsak Kanok-Nukulchai and Dr. Pennung Warnitchai for serving as theexamination committee members. I would also like to thank Dr. Putchong Uthayopas,a faculty in Department of Computer Engineering, Kasetsart University, who, throughelectronic correspondences, introduced me to the Beowulf parallel processing world. Grantedaccess to School of Civil Engineering high-performance computer workstations and financialsupport from Asian Institute of Technology are thankfully acknowledged. In addition, I wishto thank my friends, especially, Mr. Bunpot Nicrowanajamrat, Mr. Teera Tosukhowong andMs. Gallissara Agavatpanitch, for their generosities throughout my twenty-month residenceat AIT. Friendliness is the most important factor that makes this institute a pleasant place tolive in. Last but not least, I wish to dedicate this research work to my parents and familymembers, who, for better or worse, give encouragements and supports throughout my hardesttimes. ii
  3. 3. ABSTRACT The element-free Galerkin method (EFGM) is a recently developed numerical techniquefor solving problems in a wide range of application areas including solid and fluid mechanics.The primary benefit of these methods is the elimination of the need for meshing (or re-meshing) complex three-dimensional problem domains. With EFGM, the discrete model ofthe object is completely described by nodes and a description of the problem domainboundary. However, the elimination of meshing difficulties does not come freely since theEFGM is much more computationally expensive than the finite element method (FEM),especially for three-dimensional and non-linear applications. Parallel processing has longbeen an available technique to improve the performance of scientific computing programs,including the finite element method. With efficient programming, parallel processing canovercome the high computing time that is typically required in analyses employing EFGM orother meshless methods. This work focuses on the application of the concepts in parallelprocessing to EFGM analyses, particularly in the formulation of the stiffness matrix, theassembly of the system of discrete equations, and the solution for nodal unknowns, so that thetime required for EFGM analyses is reduced. Several low-cost personal computers are joinedtogether to form a parallel computer with the potential for raw computing power comparableto that of the fastest serial computers. The processors communicate via a local high-speednetwork using the Message Passing Interface (MPI), a standard library of functions thatenables parallel programs to be executed on and communicate efficiently over a variety ofmachines. To provide a comparison between the parallelized and the serial versions of theEFGM computer program, several benchmark 3D structural mechanics problems are analyzedto show that the parallelized EFGM program can provide substantially shorter run time thanthe serial program without loss of solution accuracy. iii
  4. 4. TABLE OF CONTENTSChapter Title Page Title page i Acknowledgment ii Abstract iii Table of Contents iv List of Figures vi List of Tables viii List of Appendices ix 1. Introduction 1 1.1 Motivation 1 1.2 Problem Statement 2 1.3 Objectives 2 1.4 Scope 3 1.5 Research Approach 3 1.6 Contributions 3 2. Literature Review 4 2.1 Element-free Galerkin Method (EFGM) 4 2.2 Parallel Computing 15 2.3 Applications of Parallel Processing in Computational Mechanics 18 2.4 The NASA Beowulf Parallel Computer 19 3. Building the Parallel Computing Infrastructures 21 3.1 Hardware and Operating System Installation 21 3.2 Software Configuration 25 4. Development of the Parallel EFGM Software 28 4.1 Design Consideration 28 4.2 Fundamental Tools 30 4.3 Implementation 36 5. Numerical Results 53 5.1 Linear Displacement Field Patch Test 53 5.2 Cantilever Beam with End Load 59 5.3 Pure Bending of Thick Circular Arch 64 5.4 Extension of a Strip with One Circular Hole 68 5.5 Overall Performance 73 iv
  5. 5. TABLE OF CONTENTS (cont’d)Chapter Title Page 6. Conclusion and Recommendations 76 6.1 Conclusion 76 6.2 Recommendations 76 References 78 Appendix A 84 Appendix B 87 Appendix C 104 Appendix D 116 v
  6. 6. LIST OF FIGURESFigure Title Page 3-1 The AIT Beowulf Hardware Configuration 22 3-2 The AIT Beowulf NFS Configuration 27 4-1 The Multicomputer Parallel Machine Model 28 4-2 Compressed Row Storage (CRS) 32 4-3 Symmetric Compressed Row Storage 33 4-4 Illustration of the Qserv Concept 34 4-5 Flowchart of Qserv 35 4-6 Flowchart of ParEFG 39 4-7 Flowchart of the ddefg_stiff Module 40 4-8 Flowchart of the ddforce Module 43 4-9 Flowcharts of the master_ddsolve and the worker_ddsolve Modules 46 4-10 Flowcharts of the master_parallel_gauss and the worker_parallel_gauss Modules 47 4-11 Row-wise Cyclic Striped Partitioning of Matrices 48 4-12 A parallel_gauss Package 48 4-13 Flowchart of the parallel_gauss Module 49 5-1 Linear Displacement Field Patch Test 53 5-2 Displacement in the x-direction along the Line y=1.50, z=1.50 for the Linear Displacement Field Patch Test 55 5-3 Displacement in the y-direction along the Line y=1.50, z=1.50 for the Linear Displacement Field Patch Test 55 5-4 Displacement in the z-direction along the Line y=1.50, z=1.50 for the Linear Displacement Field Patch Test 56 5-5 Tensile Stress in the x-direction along the Line x=3.00, z=1.50 for the Linear Displacement Field Patch Test 58 5-6 Average Speedups for the Linear Displacement Field Patch Test 58 5-7 Average Efficiencies for the Linear Displacement Field Patch Test 58 5-8 Cantilever Beam with End Load 59 5-9 Vertical Displacement along the Neutral Axis (Line y=0.50, z=0.50) for a Cantilever Beam under a Concentrated Force 60 5-10 Bending Stress Distribution along the Line x=6.00, z=0.50 for a Cantilever Beam under a Concentrated Force 61 5-11 Average Speedups for a Cantilever Beam under a Concentrated Force 63 5-12 Average Efficiencies for a Cantilever Beam under a Concentrated Force 63 5-13 Pure Bending of Thick Circular Arch 64 vi
  7. 7. LIST OF FIGURES (cont’d)Figure Title Page 5-14 Displacement in the x-direction along the Neutral Axis of a Thick Circular Arch under Pure Bending 65 5-15 Tangential Stress Distribution through the Thickness of a Thick Circular Arch under Pure Bending 65 5-16 Average Speedups for a Thick Circular Arch under Pure Bending 67 5-17 Average Efficiencies for a Thick Circular Arch under Pure Bending 67 5-18 Extension of a Strip with One Circular Hole 68 5-19 Tensile Stress Distribution along the Line through the Center of the Hole and Perpendicular to the x-axis, for a Strip with One Circular Hole under Uniform Tension 70 5-20 Tensile Stress Distribution along the Line through the Center of the Hole and Perpendicular to the y-axis, for a Strip with One Circular Hole under Uniform Tension 70 5-21 Average Speedups for a Strip with One Circular Hole under Uniform Tension 72 5-22 Average Efficiencies for a Strip with One Circular Hole under Uniform Tension 72 5-23 Speedups of the Stiffness Computing Module under Various Number of Processes and Degrees of Freedom 73 5-24 Speedups of the Parallel Equation Solver Module under Various Number of Processes and Degrees of Freedom 74 5-25 Efficiencies of the Stiffness Computing Module under Various Number of Processes and Degrees of Freedom 74 5-26 Efficiencies of the Parallel Equation Solver Module under Various Number of Processes and Degrees of Freedom 75 vii
  8. 8. LIST OF TABLESTable Title Page 3-1 The Server Hardware Configuration 23 3-2 The Workstation Hardware Configuration 24 3-3 Networking Equipments 24 3-4 Common Network Properties 25 3-5 Nodal Specific Network Properties 25 4-1 Partitioning of the Major Tasks in ParEFG 30 4-2 Frequently Used Meschach Library Functions 30 4-3 Frequently Used MPI Library Functions 32 5-1 Average Run Times for the Linear Displacement Field Patch Test 57 5-2 Average Speedups for the Linear Displacement Field Patch Test 57 5-3 Average Efficiencies for the Linear Displacement Field Patch Test 57 5-4 Average Run Times for a Cantilever Beam Under a Concentrated Force 62 5-5 Average Speedups for a Cantilever Beam Under a Concentrated Force 62 5-6 Average Efficiencies for a Cantilever Beam Under a Concentrated Force 62 5-7 Average Run Times for a Thick Circular Arch Under Pure Bending 66 5-8 Average Speedups for a Thick Circular Arch Under Pure Bending 66 5-9 Average Efficiencies for a Thick Circular Arch Under Pure Bending 66 5-10 Average Run Times for a Strip with One Circular Hole under Uniform Tension 71 5-11 Average Speedups for a Strip with One Circular Hole under Uniform Tension 71 5-12 Average Efficiencies for a Strip with One Circular Hole under Uniform Tension 71 viii
  9. 9. LIST OF APPENDICESAppendix Title Page A Configuration Files 84 A1 Common Network Configuration Files 84 A2 The NFS Configuration Files 85 B Input Files 87 B1 Linear Displacement Field Patch Test 87 B2 Cantilever Beam with End Load 89 B3 Pure Bending of Thick Circular Arch 92 B4 Extension of a Strip with One Circular Hole 95 C Sample Output File 104 C1 ParEFG Interpretation of the Input Data 104 C2 Analysis Results 113 C3 Analysis Logs 115 D Source Codes 116 D1 The Queue Server 116 D2 The Parallel EFGM Analysis Software 117 ix
  10. 10. CHAPTER 1 INTRODUCTION1.1 Motivation In performing the finite element analysis of structural components, meshing, which isthe process of discretizing the problem domain into small sub-regions or elements withspecific nodal connectivities, can be a tedious and time-consuming task. Although somerelatively simple geometric configurations may be meshed automatically, some complexgeometric configurations require manual preparation of the mesh. The element-free Galerkinmethod (EFGM), one of the recently developed meshless methods, avoids the need formeshing by employing a moving least-squares (MLS) approximation for the field quantitiesof interest (displacements in solid mechanics applications). With EFGM, the discrete model ofthe object is completely described by nodes and a description of the problem domainboundary. This is a particular advantage for problems such as the modeling of crackpropagation with arbitrary and complex paths, the analysis of structures with movinginterfaces, or the analysis of structural components undergoing large deformations. Since noremeshing is required for each step in the analysis, geometric discontinuities of the problemdomain can be more easily handled. However, the advantage of avoiding the requirement of a mesh does not come cheap.EFGM is much more computationally expensive than the finite element method (FEM). Theincreased computational cost is especially evident for three-dimensional and non-linearapplications of the EFGM, due to the usage of MLS shape functions, which are formulated bya least-squares procedure at each integration point. This computational costliness is thepredominant drawback of EFGM. Parallel processing has long been an available technique to improve the performance ofscientific computing programs, including the finite element method. According toreference [14], the ‘divide and conquer’ paradigm, which is the concept of partitioning a largetask into several smaller tasks, is frequently employed in parallel programming. These smallertasks are then assigned to various computer processors. Kumar et al. [59] compared parallelprocessing to a master-workers relationship. The master divides a task into a set of subtasksassigned to multiple workers. Workers then cooperate and accomplish the task in unison.With efficient programming, parallel processing can significantly reduce the high computingtime that is typically required in EFGM analyses. A network of personal computers (PCs), which are typically much cheaper thanworkstation-class computers, is sufficient for the development of parallel processingalgorithms [42]. From references [8,42,56,59,77] it can be inferred that by connecting theselow-cost PCs to form a parallel computer, it is possible to obtain raw computing powercomparable to that of the fastest serial computers such as the Cray vectorized supercomputers.The connection of such PC’s can be accomplished by using the Message Passing Interface 1
  11. 11. (MPI) [32], which is a message-passing1 standard that enables parallel programs to beexecuted on and to communicate efficiently over a variety of machines. The focus of this research is to apply the concepts in parallel processing to EFGManalyses, particularly in the formulation of the stiffness matrix, the assembly of the system ofdiscrete equations, and the solution for nodal unknowns, to minimize the time required for theEFGM analyses of structural components.1.2 Problem Statement The goal of this work is to design, implement, and test a parallel computer code for theanalysis of structural components by the element-free Galerkin method. Past developments inparallelizing the finite element method and other meshless methods are studied, evaluated,and further developed within the EFGM framework, resulting in fast and efficient EFGManalyses. Benchmark problems in three-dimensional elasticity are analyzed to show that theparallelized EFGM computer code provides substantially shorter run time than the serialEFGM computer code, without discrepancy of results. In the future, the resulting parallelizedanalysis tool may be extended to more complex problems in which EFGM gives distinguishedadvantages over FEM, such as the aforementioned modeling of crack propagation, theanalysis of structures with moving interfaces, and large deformation and large strain analysisof solids.1.3 Objectives The specific objectives of this research are listed as follows: 1) To set up a parallel computer from a network of personal computers. 2) To investigate past developments in parallelizing the finite element method and other meshless methods. 3) To identify and evaluate several algorithmic alternatives for parallelizing the element- free Galerkin method. 4) To develop and implement a parallel computer code to compute the EFGM stiffness matrix and the EFGM vector of equivalent nodal forces, assemble the system of equations, solve the system of equations, and post-process the solution data. 5) To provide accuracy, run time, and speedup comparisons between the parallelized version of EFGM and the serial version, as applied to the aforementioned benchmark problems in 3D elasticity.1 A type of interaction method for parallel processors. See Section 2.2.2 for a detailed explanation. 2
  12. 12. 1.4 Scope Since the problem of concern is to parallelize the EFGM code, so that the run time ofEFGM analyses is reduced, the development and implementation of the code is limited tothree-dimensional problems, since they require significantly longer run times, when comparedwith two-dimensional problems, even without the use of complex geometrically or materiallynon-linear formulations. Thus, benchmark problems in three-dimensional elastostatics areconsidered in this work. Once an efficient linear code is achieved, it may be extended in thefuture to the analysis of non-linear problems, such as material plasticity and large strainanalyses.1.5 Research Approach To achieve the above stated objectives, concepts from both structural engineering andcomputer science are applied. Because most parallel computing software libraries and toolshave been developed in the UNIX operating system, the Linux operating system, which is afree implementation of UNIX on personal computers, was used in this research. Algorithmswere developed and implemented into a computer program using the C programminglanguage. An existing serial computer program for EFGM analysis [60,61], written in the Clanguage, was studied and parallelized. This serial code was also used to analyze thebenchmark problems, and the results are presented in Chapter 5 for the sake of comparison. The parallel program developed in this research employs the message passing facilitiesof mpich [62], which is a portable public domain implementation of the full MPIspecification, developed at Argonne National Laboratory (ANL) in the United States, for awide variety of parallel computing environments. Since MPI is an industrial standard [32],using MPI for all message-passing communication provides portable building blocks for theimplementation of large-scale EFGM application codes on a variety of more sophisticatedparallel computers.1.6 Contributions This research addresses the computational costliness associated with EFGM, especiallyfor three-dimensional applications, through the development and implementation of parallelalgorithms and computer codes. The development of this work may be incorporated into morecomplex EFGM applications for accurate and efficient analysis of three-dimensional,non-linear mechanics of solids, with substantially reduced computational time. 3
  13. 13. CHAPTER 2 LITERATURE REVIEW2.1 Element-free Galerkin Method (EFGM)2.1.1 General Meshless methods, numerical analysis techniques whose discrete model of the structuralcomponent or object is described by only nodes and a description of the problem domainboundary, were first developed in the late 1970s. It was mentioned in references [53] and [49]that the first meshless method, which was called Smoothed Particle Hydrodynamics (SPH),was developed by Lucy [24] in 1977 for modeling astrophysical phenomena. Gingold andMonaghan [44] used this method for problems on infinite domains, i.e. no boundaries, such asrotating stars and dust clouds. Libersky et al. [25] extended SPH to solid mechanics problemsbut problems associated with instability of the solutions were reported [21]. In 1992, Nayroles et al. [3] applied a least-squares technique in conjunction with theGalerkin method for solving 2D problems in solid mechanics and heat conduction. Theycalled this the diffuse element method (DEM). A basis function and a weight function wereused to form a smooth approximation based on a set of nodes with no explicit elements. Thebasic idea was to replace the FEM interpolation by a local, weighted, least-squares fittingvalid within a small neighborhood surrounding each nodal point. They suggested, with greatinsight, that adding and removing nodes or locally modifying the distribution of nodes wasmuch easier than completely rebuilding FEM meshes. Belytschko et al. [51] showed that the approximation used in the work ofNayroles et al. [3] was in fact the moving least-squares (MLS) approximation described byLancaster et al. in reference [40]. Since MLS approximation functions were not interpolatingfunctions, the essential boundary conditions could not be directly satisfied by the Galerkinmethod. They refined the DEM by implementing a higher order of Gaussian quadrature,adding certain terms in the shape function derivatives that had been formerly omitted byNayroles et al. [3], and employing Lagrange multipliers to enforce essential boundaryconditions. The result was a new Galerkin method, that utilized moving least-squaresapproximants, and was called the element-free Galerkin method (EFGM). The method hasbeen proven very effective for solving a wide range of problems in 2D and 3D solidmechanics, such as static fracture mechanics and crack propagation [34,38,69]. It was cited in reference [53] that in addition to the works of Nayroles et al. [3] andBelytschko et al. [51], there have been several other meshless methods developed, namely, thereproducing kernel particle method (RKPM) [64], the hp-clouds methods [4], the partition ofunity finite element method (PUFEM) [20], the particle in cell (PIC) method [9], thegeneralized finite difference method [55], and the finite point method [10]. A comprehensivereview of meshless methods can be found in reference [49]. 4
  14. 14. 2.1.2 Development of the EFGM Since its debut in 1994, the benefits of EFGM have been demonstrated in many fields.A large volume of research has contributed to EFGM development in a relatively short periodof time. As a pioneer work, Belytschko et al. [51] applied the EFGM to two-dimensionalelastostatics, static fracture mechanics, and steady-state heat conduction problems. It wasshown that EFGM does not exhibit any volumetric locking even when linear basis functionsare used, the rate of convergence of the method may significantly exceed that of the finiteelement method, and a high resolution of localized steep gradients can be achieved. Theysuggested that since element connectivities were not needed and the accuracy was notsignificantly affected by irregular nodal arrangements, progressively growing cracks could beeasily modeled. It was noted that the use of Lagrange multipliers complicated the solutionprocess and pointed out that the problem could be remedied by the use of perturbedLagrangian or other penalty methods. Because the MLS approximation function does not produce exact field values at nodalpoints, the imposition of essential boundary conditions is a major problem in MLS-basedmeshless methods such as EFGM. As a result, a great deal of research in the area of EFGMhas focused on finding a better technique for the imposition of such boundary conditions. Lu et al. [73], realizing that the use of Lagrange multipliers increases the cost of solvingthe linear algebraic equations in EFGM, developed a new implementation of EFGM byreplacing the Lagrange multipliers at the outset by their physical meaning, resulting in abanded and positive-definite discrete system of equations. Orthogonal MLS approximantswere also constructed to eliminate the need for matrix inversion at each quadrature point.They solved two-dimensional elastostatic and static fracture mechanics problems with theirnew method and compared the results to those from the original EFGM with Lagrangemultipliers. From the comparison, although higher efficiency was achieved, the accuracy ofthis new method was inferior to the original method. Krongauz and Belytschko [67] proposed a method to impose the essential boundaryconditions in EFGM using finite elements. They used a technique that employed a strip offinite elements along the essential boundaries. The shape functions from the edge finiteelements were combined with the MLS shape functions that are employed in EFGM analyses.With this technique, the essential boundary conditions could be imposed directly as withfinite elements. They claimed that, from numerical studies of elastostatic problems, the highconvergence rate associated with MLS approximation was still retained. However, this is notalways true. The high convergence rate was achieved because only small strips of finiteelements were used; therefore EFGM errors still dominated the numerical solutions. If thefinite elements had been used so extensively that their errors dominated, a lower convergencerate, consistent with that of the FEM, would have been obtained. Belytschko et al. [48] referred to the technique in the previous paragraph as the couplingof FEM-EFGM. In contrast to the work by Krongauz and Belytschko, in which finiteelements were used in a small fraction of the problem domain, it was recommended thatEFGM be used in only relatively small regions of the problem domain where it was mostbeneficial, such as near crack tips or other locations of singularity. It was noted in reference[53] that EFGM could provide an excellent complement to the FEM in situations where finiteelements were not effective. Hegen [7] also proposed the same idea about the coupling of 5
  15. 15. FEM and EFGM. Both Belytschko et al. [48] and Hegen [7] suggested that since EFGMrequired considerably more computer resources, limiting EFGM modeling to the needed areascould save significant computational time. Therefore, the coupling of FEM and EFGM mightbe viewed as a technique to impose the essential boundary conditions and also as a techniqueto speed up the computational time. It was noted in reference [38] that EFGM could becoupled seamlessly with parallel versions of finite element programs, thus making theanalysis run even faster. Belytschko et al. [48] used FEM-EFGM coupling to solve two-dimensional problems in elastostatics, elastodynamics, dynamic fracture mechanics, and aone-dimensional wave propagation problem, while Hegen [7] used the same technique tosolve two-dimensional elastostatic and static fracture mechanics problems. High efficiencyinherited from FEM was obtained as expected. However, the high convergence rate of EFGMwas lost because the FEM error dominated the numerical solutions. Mukherjee et al. [68] developed an alternative strategy for imposing the essentialboundary conditions. They proposed a new definition of the discrete norm that is typicallyminimized to obtain the coefficients in MLS approximations2. It was reported that theirstrategy worked well for 2D EFGM problems. Zhu et al. [57] presented a modified collocationmethod and a penalty formulation to enforce the essential boundary conditions in EFGM, asan alternative to the method of Lagrange multipliers. It was reported that their formulationgave a symmetric positive-definite system of equations while the absence of volumetriclocking and a high convergence rate were retained. Kaljevic and Saigal [15] applied a technique that employed singular weight functions inthe formulation of MLS approximations. With the singular weight function, the approximantspassed through their respective nodes and therefore, the essential boundary conditions couldbe explicitly satisfied. With the use of singular weight functions, the MLS approximants couldthen be termed interpolants. This technique resulted in a reduced number of positive-defineand banded discrete equations. Two-dimensional elastostatic and static fracture mechanicsproblems were solved. They reported that both higher efficiency and higher accuracy, ascompared to the previous implementations of EFGM, were achieved. In addition to the development of techniques for the imposition of essential boundaryconditions, there were a number of works to improve other aspects of EFGM. Arepresentative selection of these works is described in the following paragraphs. Belytschko et al. [50] developed a new procedure, for computing shape functions for theEFGM, that preserves the continuity of functions in domains with concave boundaries. Theprocedure was applied to elastostatic and static fracture problems. Overall accuracy wasimproved while the convergence rate was unchanged. A new method for calculation of MLSapproximants and their derivatives was also devised. It was reported that this method gave asubstantial decrease in computational time as compared to the previous formulations. Beissel and Belytschko [46] explored the possibility of evaluating the integrals of theweak form only at the nodes. Nodal integration would make EFGM truly element-free, that is,the need for background integration cells would be eliminated. It was shown that their nodalintegration scheme suffered from spurious singular modes resulting from under-integration ofthe weak form. A technique for treating this instability was developed and tested for2 See equation (2.2) on page 8. 6
  16. 16. elastostatic problems. Good numerical results were achieved after this treatment. However, itwas noted in reference [53] that the accuracy obtained was inferior to that of the backgroundintegration cell method. Kaljevic and Saigal [15] developed a numerical integration scheme which employed theconcept of dividing the rectangular integration cells that partially belong to the problemdomain into sub-cells that completely belong to the domain. With this technique the automaticand accurate treatment of two-dimensional domains with arbitrary geometric configurationswas made possible. Krysl and Belytschko [37] examined the construction of the shape functions in EFGMand discussed the implications that the choices of those shape functions have on theconvergence rates. It was shown that, for non-convex domain boundaries, it was possible toconstruct and use discontinuous weight functions that lead to discontinuous shape functions.The convergence rate of the variant of the EFGM that used such a construction of shapefunctions was not affected by the discontinuities when linear shape functions were used. Häussler-Combe and Korn [58] presented a scheme for automatic, adaptive EFGManalysis. Based on the interpolation error estimation and geometric subdivision of the problemdomain into integration cells, they developed an a posteriori adaptive strategy to move,discard, or introduce nodes to the nodal discretization of the problem domain. Dense nodalarrangements were generated automatically in sub-domains where high accuracy was needed.The technique showed good results for elastic, elastoplastic, and static fracture mechanicsproblems. Belytschko and Fleming [53] compared the methods for smoothing the approximationsnear non-convex boundaries, such as cracks, and techniques for enriching the EFGMapproximations near the tip of linear elastic cracks. They introduced a penalty method-basedcontact algorithm for enforcing crack contact in the overall compressive fields. To illustratethe new technique, crack propagation under compressive loading with crack surface contactwas simulated. It was found that their numerical results closely matched experimental results. Dolbow and Belytschko [19] investigated the numerical integration of Galerkin weakforms for the meshless methods using EFGM as a case study. They pointed out that theconstruction of quadrature cells without consideration of the local supports of the shapefunctions could result in a considerable amount of integration error, and presented a techniqueto construct integration cells that align with the shape function supports to improve the Gaussquadrature accuracy. Meshless methods remain a very active area of research as evidenced by the numerousarticles that appear each month in the top international journals in the field of computationalmechanics. As meshless methods mature and are applied to increasingly challengingproblems, interest in parallel algorithms for structural analysis with meshless methods isalmost certain to significantly increase.2.1.3 Formulation of the EFGM The original formulation of EFGM by Belytschko et al. [51] in which Lagrangemultipliers were used to impose the essential boundary conditions is presented in this section. 7
  17. 17. The MLS approximation is first introduced. Then, various types of weight functions arepresented. Finally, the formulation of discrete equations, for application to 3D elastostatics, isdescribed. The formulation of the improved versions of the EFGM can be found from therespective previously cited references.MLS approximation In the moving least-squares approximation, we let the approximation of the functionu(x) be written as m u h (x) = ∑ p j (x) a j (x) ≡ p T (x) a(x) (2.1) jwhere m is the number of terms in the basis, p j (x) are monomial basis functions, and a j (x)are their as yet undetermined coefficients. Note that both p j (x) and a j (x) are functions ofthe spatial coordinates, x. Examples of commonly used linear and quadratic bases are asfollows: One dimensions Two dimensions Three dimensionsLinear p T = [1, x ] p T = [1, x, y ] p T = [1, x, y, z ]basisQuadratic [ p T = 1, x, x 2 ] [ p T = 1, x, y, xy, x 2 , y 2 ] [ p T = 1, x, y, z , xy, xz, yz, x 2 , y 2 , z 2 ]basis The coefficients a j (x) in equation (2.1) are obtained by minimizing a weighted,discrete L2 norm as follows: [ ] n J = ∑ w(x − x I ) p T (x I ) a(x) − u I 2 (2.2) Iwhere u I is the nodal value of u at x = x I , and n is the number of nodal points that arevisible from x, i.e. the weight functions w(x −x I ) are non-zero. The regions surrounding nodalpoints in which the weight functions are non-zero are termed the domains of influence of theirrespective nodal points. The stationarity of J in equation (2.2) with respect to a(x) leads to the following linearrelation between a(x) and u I : A(x) a(x) = B(x) u (2.3) 8
  18. 18. or a(x) = A −1 (x) B(x) u (2.4)where A(x) and B(x) are the matrices defined by n A(x) = ∑ wI (x) p(x I ) p T (x I ) and wI (x) ≡ w(x − x I ) (2.5) I B(x) = [w1 (x) p(x 1 ), w2 (x) p(x 2 ), l, wn (x) p(x n ),] (2.6) u T = [u1 , u 2 , l , u n ] . (2.7)Therefore, we have n m n u h (x) = ∑∑ p j (x) ( A −1 (x) B(x)) jI u I ≡ ∑ φ I (x) u I (2.8) I j Iwhere the shape function φ I (x) is defined by m φ I (x) = ∑ p j (x) ( A −1 (x) B(x)) jI . (2.9) jThe partial derivatives of φ I (x) can be obtained as follows: { } m φ I ,i (x) = ∑ p j ,i ( A −1 B) jI + p j ( A −1 B + A −1 B ,i ) jI ,i (2.10) jwhere A ,i1 = − A −1 A ,i A −1 − (2.11)and the index following a comma indicates a spatial derivative.Weight functions The weight functions wI (x) ≡ w(x − x I ) play an important role in the performance ofthe EFGM. They should be constructed so that they are positive and so that a unique solutiona(x) of equation (2.3) is guaranteed. Also, they should decrease in magnitude as the distancefrom x to x I increases. 9
  19. 19. Let d max be the size of the support for the weight function, d I = x − x I , andd = d I d max . Commonly used MLS weight functions [49] are presented as follows:  exp(−d α ) 2 for d ≤ 1Exponential: w(d ) =  (2.12)  0 for d > 1  2 1  − 4 d 2 + 4d 3 for d ≤ 3 2   4 4 1Cubic spline: w(d ) =  − 4d + 4d 2 − d 3 for < d ≤1 (2.13)  3 3 2  0 for d > 1    1 − 6d 2 + 8d 3 − 3d 4 for d ≤ 1Quartic spline: w(d ) =  (2.14)  0 for d > 1  1  1 2  1 −  for d ≤ 1  d2  dSingular: w(d ) =  (2.15)  0 for d > 1.   According to reference [49], the exponential weight function is actually C -1 continuoussince it is not equal to zero at d = 1, but for numerical purposes, it resembles a weightfunction with C1 continuity or higher; in the exponent, the parameter α = 0.4 results in w(1) ≅ 0.002 . The cubic and quartic spline weight functions, constructed to possess C2continuity, are more favorable than the exponential weight function because they providebetter continuity and are computationally less demanding. The singular weight functionallows the direct imposition of essential boundary conditions, thus eliminating the need forLagrange multipliers. Kaljevic and Saigal noted in reference [15] that singularities will notcomplicate EFGM problems since they can be removed through algebraic manipulations.Formulation of discrete equations An EFGM formulation for 3D elastostatics, starting from the variational form andemploying Lagrange multipliers, is now presented. Consider the following three-dimensional problem on the domain Ω bounded by Γ : ∇ ⋅ σ + b = 0 in Ω (2.16)where σ is the Cauchy stress tensor, which corresponds to the displacement field u and b isthe body force vector. The boundary conditions are given as follows: 10
  20. 20. σ ⋅ n = t on Γ t (2.17) u = u on Γ u (2.18)in which the superposed bar denotes prescribed boundary values, and n is the unit normal tothe domain boundary Γ . The variational or weak form of the equilibrium equation is posed as follows. Considertrial functions u(x) ∈ H 1 and Lagrange multipliers λ ∈ H 0 and test functions ∂v(x) ∈ H 1 and ∂λ ∈ H 0 . Then if∫Ω ∂ (∇ s v T ) : σ dΩ - ∫ Ω ∂v T ⋅ b dΩ - ∫ Γt ∂v T ⋅ t dΓ ∫ ∂λT ⋅ (u − u) dΓ - ∫ ∂v ⋅ λ dΓ = 0 ∀∂v ∈ H 1 , ∂λ ∈ H 0 T - (2.19) Γu Γuthe equilibrium equation (2.16) and the boundary conditions (2.17) and (2.18) are satisfied.Here ∇ s v T is the symmetric part of ∇v T ; and H 1 and H 0 denote the Sobolev spaces ofdegree one and zero, respectively. Note that the trial functions, computed using MLSapproximation functions, do not satisfy the essential boundary conditions and therefore theuse of Lagrange multipliers is necessitated. In order to obtain the discrete system of equations from the weak form (2.19), theapproximate solution u and the test function ∂v are constructed according to equation (2.8).The Lagrange multiplier λ is written as λ ( x) = N I ( s ) λ I , x ∈ Γ u (2.20) ∂λ (x) = N I ( s ) ∂λ I , x ∈ Γ u (2.21)where N I ( s) is a Lagrange interpolant and s is the arc length along the problem domainboundary; the repeated indices designate summations. The final discrete equations can beobtained by substituting the trial functions, test functions and equations (2.20) and (2.21) intothe weak form (2.19), which yields K G  u   f  G T  =  (2.22)  0  λ  q   where K IJ = ∫ B T D B J dΩ , I (2.23a) Ω 11
  21. 21. G IK = − ∫ φ I N K dΓ , (2.23b) Γu f I = ∫ φ I t dΓ + ∫φ I b dΩ , (2.23c) Γt Ωand q K = − ∫ N K u dΓ (2.23d) Γuwhere φ I , x 0 0     0 φI,y 0     0 0 φI ,z  BI =   , (2.24a) φ φI,x 0   I,y     0 φI ,z φI , y    φ I , z  0 φI ,x   N K 0 0  NK =  0  NK 0  ,  (2.24b)  0  0 NK   (1 − ν ) c νc νc 0 0 0   νc (1 − ν ) c νc 0 0 0     νc νc (1 − ν ) c 0 0 0  D=  ,  0 0 0 G 0 0   0 0 0 0 G 0     0  0 0 0 0 G  and E E c= ; G= for isotropic materials. (2.24c) (1 + ν )(1 − 2ν ) 2 (1 + ν )In the above expressions E and ν are Youngs modulus and Poissons ratio, respectively. 12
  22. 22. 2.1.4 Applications of the EFGM In addition to the test problems mentioned in Section 2.1.2, EFGM has been applied bymany researchers to several different classes of problems. Belytschko et al. [52] applied EFGM to two-dimensional static and dynamic fracturemechanics. They demonstrated the capability of EFGM to model complex problems involvingthe evolution of growing cracks. In both static and dynamic cases a growing crack could bemodeled simply by extending the free boundaries associated with the crack. Lu et al. [72]used EFGM to solve one-dimensional wave propagation and two-dimensional dynamicfracture problems. They developed the weak form of the kinematic boundary condition toenforce the kinematic boundary condition. It was shown that accurate mode I and mode IIstress intensity factors could be computed by EFGM. Krysl and Belytschko used EFGM in the analysis of arbitrary Kirchhoff plates [35] andshells [36]. The satisfaction of C1 continuity was easily met since EFGM required only C1weight functions; therefore the Mindlin-Reissner theory or the discrete Kirchhoff theory wasnot necessary. High accuracy was achieved for arbitrary grid geometries in clamped andsimply supported plates. Membrane locking for the shell cases was alleviated by enlarging thedomain of influences of the nodes for the quadratic basis and was completely eliminated byusing the quartic polynomial basis. The application of EFGM to solid mechanics problems containing materialinhomogeneities was presented by Cordes and Moran [26]. Very accurate displacement resultswere reported, and a set of filtering schemes was introduced to improve the numericalsolution in the stress and strain fields. Additional problems in two-dimensional dynamic fracture mechanics were solved withEFGM by Belytschko and Tabbara [54]. They suggested that the method had the potential toaccurately predict almost arbitrary crack paths, and could be easily extended to anisotropicand non-linear problems. Qualitative behaviors of the their numerical results agreed well withthe experimental results. Sukumar et al. [34] applied coupled FEM-EFGM to problems inthree-dimensional fracture mechanics. Domain integral methods were used to evaluate thestress intensity factors along a three-dimensional crack front. Fleming et al. [29] introduced anenriched EFGM formulation for fracture problems. It was shown that the new formulationgreatly reduced the numerical stress oscillations near the crack tips and yielded accurate stressintensity factors with the use of significantly less degrees of freedom. The EFGM analysis of stable crack growth in an elastic solid was pioneered by Xu andSaigal [70]. In their formulation, the inertia force term in the momentum equation wasconverted to a spatial gradient term by employing the steady state conditions. A convectivedomain was employed to account for the analysis domain moving at the same speed as thecrack front. Good agreements of the numerical results with the analytical solutions werereported. They noted that their formulation was a promising tool for the analysis of stablecrack growth problems in both elastic-plastic and elastic-viscoplastic solids. In reference [71],Xu and Saigal extended their work to the analysis of steady quasi-static crack growth underplane strain conditions in elastic-perfectly plastic materials. Numerical studies showed verygood agreements with the corresponding asymptotic solutions. Recently the EFGM analysisof steady dynamic crack growth in elastic-plastic materials was done by the same researchers 13
  23. 23. [69]. They considered both rate-independent materials and rate-dependent materials.Numerical results also agreed well with the analytical solutions. The application of EFGM to acoustic wave propagation was investigated by Bouillardand Suleau [43]. They implemented an EFGM for analyzing harmonic forced response inacoustic problems. When compared to FEM, it was reported that EFGM gave a better controlof dispersion and pollution errors, which were the specific problems associated with acousticnumerical analysis. The most advanced and efficient application of EFGM seems to be the recent work byKrysl and Belytschko [38]. The coupling of FEM-EFGM was used to model arbitrary three-dimensional propagating cracks in elastic bodies. An EFG super-element was developed andembedded in an explicit finite element system. The super-element was used in the region ofthe problem domain where there was a potential that cracks would propagate through.Complex simulations such as the mixed-mode growth of center through-crack in a finite plate,the mode-I surface-breaking penny-shaped crack in a cube, the penny-shaped crack growingunder general mixed-mode conditions, and the torsion-tension rectangular bar with centerthrough crack, were successfully analyzed.2.1.5 Computational cost and efficiency of the EFGM One of the major disadvantages of the EFGM is the increased computational cost, whencompared to the FEM. Belytschko et al. stated in reference [50] that the additionalcomputational load of EFGM was from several sources listed as follows: 1. The need to identify the nodes in the domain of influence for all points at which the approximating function is calculated; 2. The relative complexity of the shape functions, which increased the cost of evaluating them and their derivatives; 3. The additional expense of dealing with essential boundary conditions. To date, published works have dealt with the second and the third items in the abovelist. From Section 2.1.2, the work by Lu et al. [73] simplified the computation of MLS shapefunctions and the treatment of essential boundary conditions. The work by Belytschko et al.[50] simplified the calculation of MLS shape functions and their derivatives. In addition, thework by Belytschko et al. [48,67] and Hegen [7] that dealt with the coupling of FEM-EFGMsimplified the treatment of essential boundary conditions. Belytschko et al. even suggested inreference [48] that EFGM be used only as required for higher accuracy in a problem domainconsisting primarily of finite elements so that the computational costliness of the EFGM couldbe avoided. This was reflected through their recent work in three-dimensional crackpropagation simulation in which EFGM was coupled with the parallel version of an FEMprogram [38], as discussed in Section 2.1.4. It should be noted that the coupling of FEM-EFGM as suggested by Belytschko et al. [48] significantly reduces the high convergence rateassociated with the original versions of EFGM. The development of a truly parallel EFGMcode resulting in fast EFGM solutions with high convergence rate has not been found in theavailable literature. 14
  24. 24. 2.2 Parallel Computing2.2.1 General Many of todays complex problems in physics, chemistry, biology, meteorology, andengineering require computational speeds well over the limits attainable by sequentialmachines to get real-time solutions. Carter et al. stated in reference [65] that the trends indevelopment of electronic machines for large scientific computations are pointing towardparallel computer architectures as an answer to the increasing users demand. According toreference [2], parallel processing is a method of using many small tasks to solve one largeproblem. It was cited in reference [14] that the ‘divide and conquer’ paradigm, which is theconcept of partitioning a large task into several smaller tasks assigned to various computerprocessors, is frequently employed in parallel programming. Kumar et al. [59] comparedparallel processing to a master-workers relationship in which the master divides a task into aset of subtasks assigned to multiple workers who cooperate and accomplish the task in unison.Parallel computing is likely to be the most important tool for sophisticated problems in thenear future.2.2.2 Taxonomy of parallel architectures Parallel computers can be constructed in many ways. In this section, the taxonomy ofparallel architectures taken from reference [59] will be described:Control mechanism Parallel computers may be classified by their control mechanism as single instructionstream, multiple data stream (SIMD) or multiple instruction stream, multiple data stream(MIMD). In SIMD parallel computers, all processing units execute the same instructionsynchronously, whereas, in MIMD parallel computers, each processor in the computer iscapable of executing instructions independent of the other processors. SIMD computers require less hardware than MIMD computers because they have onlyone global control unit. They also require less memory because only one copy of the programneeds to be stored. In contrast, MIMD computers store the program and operating system oneach processor. SIMD computers are naturally suited for data-parallel programs; that is,programs in which the same set of instructions are executed on a large data set. Moreover,SIMD computers require less startup time for communicating with neighboring processors.However, a drawback and limitation of SIMD computers is that different processors cannotexecute different instructions in the same clock cycle.Interaction method Parallel computers are also classified by the method of interaction among processors asthe message-passing and the shared-address-space architectures. In a message-passingarchitecture, processors are connected using a message-passing interconnection network. Eachprocessor has its own memory called the local or private memory, which is accessible only to 15
  25. 25. that processor. Processors can interact only by passing messages to each other. Thisarchitecture is also referred to as a distributed-memory or private-memory architecture.Message-passing MIMD computers are commonly referred to as multicomputers. The shared-address-space architecture, on the other hand, provides hardware support for read and writeaccess by all processors to a shared address space or memory. Processors interact bymodifying data objects stored in the shared address space. Shared-address-space MIMDcomputers are often referred to as multiprocessors. It is easy to emulate a message-passing architecture containing p processors on ashared-address-space computer with an identical number of processors. This is done bypartitioning the shared address space into p disjoint parts and assigning one such partitionexclusively to each processor. A processor sends a message to another processor by writinginto the other processors partition of memory. However, emulating a shared-address-spacearchitecture on a message-passing computer is costly, since accessing another processorsmemory requires sending and receiving messages. Therefore, shared-address-space computersprovide greater flexibility in programming. Moreover, some problems require rapid access byall processors to large data structures that may be changing dynamically. Such access is bettersupported by shared-address-space architectures. Nevertheless, the hardware needed toprovide a shared-address-space tends to be more expensive than that for a message-passing.As a result, according to references [2,8,32], message-passing is the most widely usedinteraction method for parallel computers.Interconnection networks Shared-address-space computers and message-passing computers can be constructed byconnecting processors and memory units using a variety of interconnection networks whichcan be classified as static or dynamic. Static networks consist of point-to-pointcommunication links among processors and are also referred to as direct networks. Staticnetworks are typically used to construct message-passing computers. On the other hand,dynamic networks are built using switches and communication links. Communication linksare connected to one another dynamically by the switching elements to establish paths amongprocessors and memory banks. Dynamic networks are referred to as indirect networks and arenormally used to construct shared-address-space computers.Processor granularity A parallel computer may be composed of a small number of very powerful processorsor a large number of relatively less powerful processors. Processors that belong to the formerclass are called coarse-grained computers, while those belonging to the latter are called fine-grained computers. Processors that are situated between these are medium-grained computers. Different applications are suited to coarse-, medium-, or fine-grained computers tovarying degrees. Many applications have only a limited amount of concurrency3. Suchapplications cannot make effective use of a large number of less powerful processors, and arebest suited to coarse-grained computers. Fine-grained computers, however, are more costeffective for applications with a high degree of concurrency.3 The apparently simultaneous execution of two or more routines or programs [33]. 16
  26. 26. The granularity of a parallel computer can be defined as the ratio of the time requiredfor a basic communication operation to the time required for a basic computation. Parallelcomputers for which this ratio is small are suitable for algorithms requiring frequentcommunication; that is, algorithms in which the grain size of the computation (before acommunication is required) is small. Since such algorithms contain fine-grained parallelism,these parallel computers are often called fine-grained computers. On the contrary, parallelcomputers for which this ratio is large are suited to algorithms that do not require frequentcommunication. These computers are referred to as coarse-grained computers.2.2.3 Performance measures for parallel systems In measuring the performance of a given algorithm, a sequential algorithm is usuallyevaluated in terms of its execution time, expressed as a function of the size of its input [59].However, the execution time of a parallel algorithm depends not only on the input size butalso on the architecture of the parallel computer and the number of processors. Therefore, aparallel algorithm cannot be evaluated in isolation from a parallel architecture. A parallelsystem is defined as the combination of an algorithm and the parallel architecture on which itis implemented. According to the theory of parallel computing in reference [59], there aremany measures that are commonly used for evaluating the performance of parallel systems.These will be presented as follows:Run time The serial run time of a program is the time elapsed between the beginning and the endof its execution on a sequential computer. The parallel run time is the time that elapses fromthe moment that a parallel computation starts to the moment that the last processor finishesexecution. The serial runtime and the parallel run time are denoted by: Serial run time = TS (2.25a) Parallel run time = TP (2.25b)Speedup When evaluating a parallel system, we are often interested in knowing how muchperformance gain is achieved by parallelizing a given application over a sequentialimplementation. Speedup, denoted by S, is a measure that captures the relative benefit ofsolving a problem in parallel. It is formally defined as the ratio of the serial run time of thebest sequential algorithm for solving a problem to the time taken by the parallel algorithm tosolve the same problem on p processors. The p processors used by the parallel algorithms areassumed to be identical to the one used by the sequential algorithm. Mathematically, thespeedup can be expressed as: TS S= (2.26) TP 17
  27. 27. Efficiency Only an ideal parallel system containing p processors can deliver a speedup equal to p.In practice, ideal behavior is not achieved because while executing a parallel algorithm, theprocessors cannot devote 100 percent of their time to the computations of the algorithm.Efficiency, denoted by E, is a measure of the fraction of time for which a processor is usefullyemployed. It is defined as the ratio of speedup (S), to the number of processors (p). In an idealparallel system, speedup is equal to p and efficiency is equal to one. In practice, speedup isless than p and efficiency is between zero and one, depending on the degree of effectivenesswith which the processors are utilized. Mathematically, efficiency is given by: S E= (2.27) pCost The cost of solving a problem on a parallel system is defined as the product of parallelrun time and the number of processors used. Cost is sometimes referred to as work orprocessor-time product. It reflects the sum of the time that each processor spends solving theproblem. The cost of solving a problem on a single processor is the execution time of thefastest known sequential algorithm. A parallel system is said to be cost-optimal if the cost ofsolving a problem on a parallel computer is comparable to the execution time of thefastest-known sequential algorithm on a single processor.2.3 Applications of Parallel Processing in Computational Mechanics As in other fields of scientific and engineering applications, parallel processing has beena promising tool to solve complex FEM problems [1,12,17,22,65,66]. Chiang and Fulton [22]mentioned two methods for parallelizing FEM based on the generation of the elementstiffness matrix and force vector, namely, element-by-element parallelism and the subdomainparallelism. In element-by-element parallelism, each processor calculates the matrix andvector of its own elements, one at a time, whereas in the subdomain parallelism method, eachprocessor is responsible for a certain subdomain and calculates all the matrices and vectors ofthose elements. Subdomain parallelism is similar in basic idea to that of the domaindecomposition method which has so far been the predominant method in parallel FEMapplications. Examples of the domain decomposition method can be found in references [12],[17] and [66]. Escaig and Marin stated in reference [66] that the domain decomposition methodconsists of partitioning the initial problem domain into subdomains, solving the initialproblem on each subdomain, solving a problem at the interfaces of the subdomains, and back-substituting this solution to the respective subdomains. They noted that, in addition to thebenefits of parallel computing, the domain decomposition method offers the possibility ofrecalculating the solution of a non-linear problem at each step only in the affectedsubdomains, resulting in even faster solutions. This property may result in a large reduction ofthe total execution time for the problems over which the non-linearity is irregularlydistributed. Yagawa et al. [12] pointed out that the domain decomposition method is a coarse- 18
  28. 28. grained algorithm. From their study, they claimed that higher performance was achieved whenthe size of the subdomains were increased, resulting in a larger granularity of the parallelcomputation. Besides FEM, parallel processing is also being utilized in the study of meshlessmethods. According to Günter et al. [11], the domain decomposition concept was used in theRKPM parallel implementation on a distributed memory parallel computer. The domains ofinfluence were analyzed by a parallel analysis technique in the preprocessing step. Eachquadrature point was given a tag identifying the processor owning the given point. Thequadrature point information was then distributed to and evaluated by its respective processorin parallel. A special technique to enforce the essential boundary conditions was developed.With this technique, the need for additional communication within the solver in order tosatisfy the boundary conditions was eliminated. Discrete equations were solved in parallel bythe routines in the Portable, Extensible Toolkit for Scientific Computation (PETSc) [45].2.4 The NASA Beowulf Parallel Computer Recent rapid increase in performance of mass market commodity PC microprocessorsand the significant difference in pricing between PCs and relatively expensive scientificworkstations have provided an opportunity for substantial gains in performance-to-cost ratio.This leads to the idea of harnessing PC technology in parallel ensembles to provide high-endcapability for scientific and engineering applications [8]. It was cited in reference [42] that theadvancement in microprocessor technology enables the Intel based PCs to deliverperformance comparable to that of supercomputers. In addition, the availability of low-costLocal Area Network (LAN) connection makes it cheap and easy to combine these powerfulPCs or workstations4 to build a high-performance parallel computing environment. The effort to deliver low-cost high-performance computing platforms to scientificcommunities has been going on for many years. It was mentioned in reference [42] that anetwork of PCs is a good candidate since it has the same architecture as the distributedmemory multicomputer system5. Many research groups have assembled commodityoff-the-shelf (COTS) PCs and fast LAN connections to build parallel computers. The parallelcomputers of this type are suitable for coarse-grained applications that are not communicationintensive because of the high communication start-up time and the limited bandwidthassociated with the underlying network architectures [27]. The parallel computers built from the network of PCs, clusters, can be classified byworkstation ownership into two types, namely, the dedicated clusters and the non-dedicatedclusters [27]. In the case of dedicated clusters, the particular individuals do not own theworkstations and the resources are shared so that parallel computing can be performed acrossthe entire cluster. On the contrary, in the case of non-dedicated clusters, the individuals owntheir workstations and the parallel applications are executed by utilizing idle CPU cycles. Theadvantage of the former type over the latter is the fast interactive response from the dedicatednodes.4 The combinations of input, output, and computing hardware that can be used for work by the individuals [33].5 The message-passing MIMD parallel computer. See Section 2.2.2 for more detail. 19
  29. 29. The NASA Beowulf project was one of the largest initiatives for the dedicated cluster-based parallel computer. According to reference [8], the Beowulf project was a NASAinitiative sponsored by the High Performance Computing and Communications (HPCC)program to explore the potential of Pile-of-PCs6 and to develop the necessary methodologiesto apply these low cost system configurations to NASA computational requirements in theEarth and space sciences. The project emphasized three governing principles [27] that were: No custom hardware components usage Beowulf exploited the use of commodity components and computer industry standards that had been developed under competitive market conditions and were in mass production. No individual vendor owned the right to the product therefore the system could be comprised of the hardware components from many sources. Incremental growth and technology tracking As new PC technologies became available, the Beowulf system administrators had total control over the configuration of the cluster. They could choose to selectively upgrade some components of the system with the new ones that were best suited to their application needs, rather than being restricted to the vendor-based configurations. Usage of readily available and free software components Beowulf used public domain operating systems and software libraries, which were supplied with source codes. This type of software had been widely accepted and developed in the academic community, therefore the administrators could be confident that their system would deliver high software performance at lowest cost. The operating point targeted by the Beowulf project was intended for scientificapplications and users requiring repeated use of large data sets and large applications witheasily delineated coarse-grained parallelism [56]. It was reported in reference [8] that a16-processor Beowulf costing less than $50,000 sustained 1.25 Gigaflops7, which iscomparable to the much more expensive supercomputers, on a scientific space simulation.Because of the low cost but high performance feature, currently, many Beowulf-type parallelcomputers have been built across the world [77]. In Thailand, a Beowulf-type parallelcomputer named SMILE was built by Uthayopas et al. [42] at the Faculty of Engineering,Kasetsart University. The Beowulf-type parallel computers provide universities, often withlimited resources, an excellent platform to teach parallel programming courses and providecost effective computing to their computational scientists as well.6 The term used to describe a loose ensemble of PCs applied in concert to a single problem [8].7 109 floating-point operations per second. According to reference [33], FLOPS, a measure of a computerspower, is the number of arithmetic operation performed on data stored in floating-point notation in one second. 20
  30. 30. CHAPTER 3 BUILDING THE PARALLEL COMPUTING INFRASTRUCTURES It can be concluded from Section 2.4 that the Beowulf-type parallel computer is a verygood choice for parallel processing in an academic environment because of its low cost andhigh performance characteristics. Therefore, the parallel implementation of the EFGM is doneon this platform. The procedure to build the AIT Beowulf, a four-node Beowulf-like parallelcomputer will be described in this section. The node, in the context of the Beowulf, is one ofseveral computers that are connected via a local area network (LAN) to form a parallelcomputer.3.1 Hardware and Operating System Installation Based on the guidelines in references [42] and [75], currently the AIT Beowulf (seeFigure 3-1) is comprised of one server8 node and three workstation9 nodes with theconfigurations of which are presented in Table 3-1 and Table 3-2, respectively. Dual CPUarrangement was chosen for the server node so that symmetric multiprocessing (SMP)10 couldbe explored in future works. These nodes are attached to the hub11 described in Table 3-3 toform a local area network. After the hardware components were connected, Red Hat Linux 6.0, a distribution ofthe Linux operating system, was installed on each node. Red Hat Linux comes with manychoices of operating system components, called packages, to match users’ needs. For the AITBeowulf, the server operating-system packages were installed on the server node in additionto the workstation packages that were common to all nodes. Linux, the necessary operating system for the Beowulf-type parallel computer, is apublic domain POSIX-compliant UNIX-like operating system that runs on personalcomputers [27]. Linux is necessary, according to the Beowulf principles [27], because it isreadily available and distributed free-of-charge. POSIX, the acronym for Portable OperatingSystem Interface for UNIX, is an IEEE (Institute of Electrical and Electronic Engineers)standard that defines a set of operating-system services. Programs that adhere to this standardcan be easily ported from one system to another [33]. Since Linux provides a POSIXcompatible UNIX environment, serial and parallel applications written for the computersrunning UNIX, for example, scientific workstations and supercomputers, can be compiled andrun seamlessly on the Beowulf. Red Hat Linux is chosen for the AIT Beowulf because of itspowerful network management software and ease of installation.8 A computer running administrative software that controls access to the network and its resources, such as diskdrives, and provides resources to computers functioning as workstations on the local area network [33].9 See the previous definition on page 19.10 A computer architecture in which multiple processors share the same memory, which contains one copy of theoperating system, one copy of any applications that are in use, and one copy of the data [33].11 A device that joins communication lines at a central location in the network and provides a commonconnection to all devices on the network [33]. 21
  31. 31. Like the ordinary UNIX networked computer, each node of a Beowulf requires users’accounts and consistent network properties. Users accounts are created by conventional UNIXsystem administrative commands that can be found in reference [39]. The consistent networkproperties for the nodes in the AIT Beowulf are defined based on the RFC 1918 privateInternet Protocol addresses (IP addresses) guidelines. The ‘request for comment’ (RFC) is thedocument in which a standard, a protocol, or other information pertaining to the operation ofthe Internet is published [33]. The RFC 1918 can also be obtained from the Internet atURL:http://www.alternic.net/rfcs/1900/rfc1918.txt.html. The properties that are common toall nodes are presented in Table 3-4. The properties that are specific to each node arepresented in Table 3-5. Since these properties are defined based on the Internet standard, itwill be possible to implement the Internet access capability to the AIT Beowulf in the future.For more information about how to assign network properties to the nodes, the readers arereferred to references [28] and [39]. Reference [28] contains introductory materials for Linuxsystem administration. Detailed resources on UNIX system administration can be found inreference [39]. An example of the network configuration files for the AIT Beowulf can befound in Appendix A1 on page 84. 100 Mbps Fast Ethernet Hub Workstation Node 1 Workstation Node 2 Workstation Node 3 Server Nodenod1.cml.ait.ac.th nod2.cml.ait.ac.th nod3.cml.ait.ac.th svr1.cml.ait.ac.th Figure 3-1 The AIT Beowulf Hardware Configuration 22
  32. 32. Table 3-1 The Server Hardware Configuration12 Item Description CPU Dual Intel Pentium III-450 MHz Motherboard Dual CPU server motherboard Main Memory 128-MB SDRAM Hard Drive 16-GB Ultra DMA/66 ATA hard drive CD-ROM Drive Generic IDE CD-ROM drive Floppy Disk Drive Generic 1.44-MB floppy disk drive Network Card 100-Mbps Fast Ethernet card Display Adaptor OpenGL-capable graphic adaptor Monitor 17-inch monitor Keyboard Generic PS/2 keyboard Mouse Generic PS/2 mouse12 These tables contain many technical terms and abbreviations that are very common in the computer industry.The readers are referred to standard computer hardware textbooks, such as reference [31], for detaileddefinitions. 23
  33. 33. Table 3-2 The Workstation Hardware Configuration13 Item Description CPU Intel Pentium III-450 MHz Motherboard Generic motherboard Main Memory 64-MB SDRAM Hard Drive 8-GB Ultra DMA-66 ATA Hard Drive CD-ROM Drive Generic IDE CD-ROM drive Floppy Disk Drive Generic 1.44-MB floppy disk drive Network Card 100-Mbps Fast Ethernet card Display Adaptor Generic display adaptor Monitor Not required Keyboard Not required Mouse Not required Table 3-3 Networking Equipments14 Item Description 8-port 100-Mbps stackable Ethernet Hub Fast Ethernet hub LAN Cable UTP CAT-5 cables with RJ-45 connectors13 See footnote 12 on page 23.14 See footnote 12 on page 23. 24
  34. 34. Table 3-4 Common Network Properties Item Assigned Value Network Gateway Broadcast Netmask Domain Name cml.ait.ac.th Table 3-5 Nodal Specific Network Properties Computer Node Full Name IP Address Name Server svr1 svr1.cml.ait.ac.th Workstation #1 nod1 nod1.cml.ait.ac.th Workstation #2 nod2 nod2.cml.ait.ac.th Workstation #3 nod3 nod3.cml.ait.ac.th Software Configuration In addition to the installation of hardware components and operating systems, theconfiguration or installation of the Beowulf fundamental software, which can be divided intothe software libraries and the system services, is required. The software libraries that must beinstalled are the message-passing library and the application specific libraries. For this thesis,the only application specific library is Meschach, a matrix computation library. The systemservices that must be configured are the Remote Shell and the Network File System.3.2.1 Software librariesMessage-passing library The Beowulf is a message-passing MIMD parallel computer (see Section 2.4),therefore, a message-passing infrastructure is needed. The mpich library [62], which is themost widely used [26] free implementation of the Message Passing Interface (MPI) [32], waschosen for the AIT Beowulf. MPI is a message passing standard defined by the MPI Forum, acommittee composed of vendors and users formed at the Supercomputing Conference in 25
  35. 35. 1992. Since the goals of the MPI design were portability, efficiency and functionality [27], theparallel software written for the AIT Beowulf can be easily ported to more sophisticatedparallel computers. The installation of the mpich library is straightforward and will not bediscussed here. The readers are referred to reference [63] for installation procedures. Thelibrary is available from the Internet at URL:http://www.mcs.anl.gov/mpi/mpich.Matrix computation library The EFGM relies heavily on matrix computations. Matrix multiplications andinversions are needed each time that the MLS shape functions or shape function derivativesare evaluated. Therefore, an efficient and reliable matrix computation library is essential.Meschach, a powerful matrix computation library by Stewart and Leyk [6] at the AustralianNational University, was chosen. The Meschach library provides user-friendly routines, withsophisticated algorithms, to address all basic operations dealing with matrices and vectors.The readers are referred to reference [6] for detailed installation procedures.3.2.2 System servicesRemote shell The mpich library mentioned above can be used on a wide variety ofparallel-computing platforms. For the Beowulf-type parallel computers, the UNIX remoteshell utility is required to run the message-passing parallel software [62]. According to reference [39], the remote shell utility or the rsh command is a UNIXremote utility that allows a user to execute a program on a remote system15 without passingthrough the password authentication login process. Users can simply specify the hostname orIP address of the remote host to execute a command on that machine. For remote utilities, which include the Remote Shell, to function, the following arerequired: a) A remote shell server program on the remote host b) An entry in the /etc/hosts file on the remote host c) An entry in either .rhosts or the /etc/hosts.equiv file on the remote host The remote shell server program is automatically started in typical Red Hat Linuxinstallations and need not be configured again. The use of a /etc/hosts.equiv file is notadvisable for UNIX security reasons [39]. Therefore, the .rhosts file was used in theAIT Beowulf. For details on setting up the /etc/hosts and the .rhosts files, the readersare referred to Appendix A1.3 and A1.4, respectively.15 A remote computer is a computer that is accessed through some type of communication lines, rather thandirectly accessed through the keyboard-and-monitor terminal [33]. 26
  36. 36. Network File System The Network File System (NFS) is a UNIX system service developed bySun Microsystems to allow users to access a remote file system16 while making it appear as ifit were local17. In order to do this, the server has to export the file system to the workstationsthrough settings in the /etc/exports file. The workstations mount18 those exporteddirectories or file systems to their local file system through settings in the /etc/fstab file.Details on configuring the /etc/exports and the /etc/fstab files are presented inAppendix A2. The use of NFS is necessary for the Beowulf because the parallel application files haveto be locally accessible to every node; as the number of nodes increases, it would bepractically impossible to make copies of these files on every node in the cluster. The AITBeowulf NFS configuration is shown in Figure 3-2. The /usr/local and the/home/shared directories on the server node are exported to all workstation nodes. Theformer stores the mpich and Meschach software libraries, while the latter stores theapplication files. Software development is done on the server node and the resultingexecutable files are stored in the /home/shared exported directory. Once the software is to berun, all nodes in the cluster perform the input and output operations to the same exported filesystems on the server node. export /usr/local and /home/ shared mount mount mount Workstation Node 1 Workstation Node 2 Workstation Node 3 Server Nodenod1.cml.ait.ac.th nod2.cml.ait.ac.th nod3.cml.ait.ac.th svr1.cml.ait.ac.th Figure 3-2 The AIT Beowulf NFS Configuration16 In an operating system, file system is the overall structure in which files are named, stored, and organized.A file system consists of files, directories, and the information needed to locate and access these items [33].17 The opposite of remote (see footnote 15). A local device is one that can be accessed directly rather than bymeans of communication lines [33].18 To make the data storage medium accessible to a computer’s local file system [33]. 27
  37. 37. CHAPTER 4 DEVELOPMENT OF THE PARALLEL EFGM SOFTWARE After the parallel computing infrastructures were set up, ParEFG, the parallel element-free Galerkin analysis computer code was designed. ParEFG is the parallelized version of thePlastic Element-Free Galerkin (PLEFG) software, which was developed by Barry and Saigal[61]. According to reference [60], PLEFG has the capability to analyze three-dimensional,small strain, elastic and elastoplastic problems with nonlinear isotropic and kinematic strainhardening. However, the nonlinear features are beyond the scope of this thesis (see Section1.4) and are not available in the current version of ParEFG.4.1 Design Considerations As discussed in Section 2.4, the architecture of the Beowulf-type parallel computersresemble that of the multicomputer, the message-passing MIMD (multiple instructionmultiple data) parallel system, and therefore the multicomputer parallel machine model isemployed. In the multicomputer model, the parallel computer is comprised of a number ofvon Neumann computers19, or nodes, linked by an interconnected network. Each computerexecutes its own program. This program may access local memory and may send and receivemessages over the network. Messages are used to communicate with other computers or,equivalently, to read from and write to remote memories. The multicomputer parallel machinemodel is illustrated in Figure 4-1 below. I N T E R C O N N E C T I O N CPU Memory Figure 4-1 The Multicomputer Parallel Machine Model Source: Foster [16] From reference [16], four properties are desirable for high-performance parallelsoftware: concurrency, the ability to perform many actions at the same time; scalability, the19 A von Neumann computer is a robust sequential machine model used to study algorithms and programminglanguages in computer science [16]. 28