A Parallel Implementation of the Element-Free Galerkin Method


Published on

Authors: Barry, William and Vacharasintopchai, Thiti

Issue Date: 5-Dec-2001

Type: Article

Series/Report no.: Proc. 8th Asia-Pacific Conference on Structural Engineering and Construction (EASEC-8), Singapore, December 5-7, 2001;

Abstract: This work focuses on the application of parallel processing to element-free Galerkin method analyses, particularly in the formulation of the stiffness matrix, the assembly of the system of discrete equations, and the solution for nodal unknowns. The objective is to significantly reduce the analysis time while retaining high efficiency and accuracy. Several relatively low-cost Intel Pentium-based personal computers are joined together to form a parallel computer. The processors communicate via a local high-speed network using the Message Passing Interface. Load balancing is achieved through the use of a dynamic queue server that assigns tasks to available processors. Benchmark problems in 3D structural mechanics are analyzed to demonstrate that the parallelized computer program can provide substantially shorter run time than its serial counterpart, without loss of solution accuracy.

URI: http://dspace.siu.ac.th/handle/1532/132

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Parallel Implementation of the Element-Free Galerkin Method

  1. 1. Paper No. 1068, Proc. 8 th. East Asia-Pacific Conference on Structural Engineering and Construction (EASEC-8), Singapore, December 5-7, 2001. A PARALLEL IMPLEMENTATION OF THE ELEMENT-FREE GALERKIN METHOD W. Barry1 and T. Vacharasintopchai2ABSTRACT : This work focuses on the application of parallel processing to element-free Galerkin methodanalyses, particularly in the formulation of the stiffness matrix, the assembly of the system of discrete equations,and the solution for nodal unknowns. The objective is to significantly reduce the analysis time while retaininghigh efficiency and accuracy. Several relatively low-cost Intel Pentium-based personal computers are joinedtogether to form a parallel computer. The processors communicate via a local high-speed network using theMessage Passing Interface. Load balancing is achieved through the use of a dynamic queue server that assignstasks to available processors. Benchmark problems in 3D structural mechanics are analyzed to demonstrate thatthe parallelized computer program can provide substantially shorter run time than its serial counterpart, withoutloss of solution accuracy.KEYWORDS : meshless method, parallel processing, element-free Galerkin method, EFGM, queue server,Beowulf, solid mechanics1. INTRODUCTIONIn performing the finite element analysis of structural components, meshing, which is the process ofdiscretizing the problem domain into small sub-regions or elements with specific nodal connectivities,can be a tedious and time-consuming task. Although some relatively simple geometric configurationsmay be meshed automatically, some complex geometric configurations require manual preparation ofthe mesh. The element-free Galerkin method (EFGM), one of the recently developed meshlessmethods, avoids the need for meshing by employing a moving least-squares (MLS) approximation forthe field quantities of interest. With EFGM, the discrete model of the problem domain is completelydescribed by nodes and a description of the problem domain boundary. This is a particular advantagefor problems involving propagating cracks or large deformations since no remeshing is required ateach step of the analysis. Detailed formulations of the MLS approximation functions and theapplication of EFGM to problems in solid mechanics may be found in [1].However, the advantage of avoiding the requirement of a mesh does not come cheaply, as EFGM ismuch more computationally expensive than the finite element method (FEM). The increasedcomputational cost is especially evident for three-dimensional and non-linear applications of theEFGM, due to the usage of MLS shape functions, which are formulated by a least-squares procedureat each integration point. This computational costliness is the predominant drawback of EFGM.Parallel processing has long been an available technique to improve the performance of scientificcomputing programs. Typically, a parallel computer program employs the ‘divide and conquer’1 Asian Institute of Technology, Thailand, Assistant Professor2 Asian Institute of Technology, Thailand, Graduate Student
  2. 2. paradigm [2], which involves the partitioning of a large task into several smaller tasks that are thenassigned to available computer processors. Efficient load balancing ensures that all processors arebusy working on assigned tasks as long as there are unfinished tasks. The most common approachtaken in computational mechanics is domain decomposition [3], a method of static load balancing inwhich the tasks are identified prior to the analysis and assigned to each processor, along with any datathat may be required. Due to the complex nodal connectivities that arise in the EFGM, domaindecomposition may not be the most efficient approach, and thus a dynamic class of load balancingbased on the concept of a queue server is employed in this work.2. THE AIT BEOWULFThe effort to deliver low-cost, high-performance computing platforms to scientific communities hasbeen on-going for many years. A network of personal computers is attractive for this type of use sinceit has the same architecture as a distributed memory multi-computer system [4]. Many research groupshave assembled commodity off-the-shelf PC’s and fast LAN connections to build parallel computers.Parallel computers of this type, termed Beowulf computers after the NASA project of the same name[5], are suitable for coarse-grained applications that are not communication intensive because of thehigh communication start-up time and the limited bandwidth associated with the underlying networkarchitectures [6].The AIT Beowulf, a four-node Beowulf class parallel computer was assembled based on theguidelines in [5] and [7]. Red Hat Linux 6.0, including both the server and workstation operatingsystem packages, was installed on each node. The AIT Beowulf is a message-passing multiple-instruction, multiple-data (MIMD) architecture and thus a message-passing infrastructure is needed.The mpich library [8], which is the most widely used free implementation of the Message PassingInterface was chosen for the AIT Beowulf. Meschach, a powerful matrix computation library [9] isemployed for serial matrix operations that are performed on each processor.3. THE QUEUE SERVERLoad balancing has a crucial role in the performance of parallel software. If unbalanced workloads areassigned to the processors, some may finish their work and be forced to wait for the other processorsto finish, leading to reduced efficiency and increased run-times. In this work, a dynamic load-balancing agent named Qserv is developed within the framework of the EFGM. Qserv balances thecomputational load among the processors in the AIT Beowulf during run-time by acting as clerk thatdirects the queued tasks to the available processors. When one processor finishes a task, it requestsanother task from Qserv, which continues assigning the tasks to processors until no unfinished tasksremain.Figure 1 presents a flowchart of the queue server designed and implemented in the current work. Toseparate the dynamic workload allocation from normal operations, the communication between Qservand the processors is done through the UNIX socket concept developed at the University of Californiaat Berkeley [4]. When the Qserv process is initiated, it creates a socket that allows the processors tosimultaneously connect. Initially, the number of total unprocessed subtasks known to Qserv is zero,and one processor, usually the master processor, must inform Qserv of the actual value. This numberis stored in the max_num variable and can be altered by processors through the SET_MAX_NUMrequest. A processor can ask Qserv, through the GET_NUM request, for a subtask to work on. It will beassigned the numerical identifier of an unprocessed subtask, ranging from zero to max_num. Whenthe unprocessed subtasks are exhausted, an ALL_DONE signal will be sent to acknowledge therequesting processor. During the execution of Qserv, a process can also reset the subtask identifiercounter by the RESET_COUNTER request. Qserv will continue serving tasks to processors until theTERMINATE signal is received.
  3. 3. START fd = current client identifier Initialize the Socket max_fd = number of client connections maintained runstate = run state of the server program count = current counter value runstate = READY max_num = maximum counter value count = 0 request_msg = current clients request message max_num = 0 runstate = Close the YES END TERMINATE Socket NO NO Accept a client connection request and update max_fd fd <= max_fd YES Receive request_msg Error receiving request_msg YES NO Process the request request msg = request msg = request msg = request msg = TERMINATE RESET_COUNTER SET_MAX_NUM GET_NUM YES YES YES YES get the new runstate = count <= count = 0 max_num from NO TERMINATE max_num the client YES send the message max_num = new send count to the ALL_DONE to the max_num client client Close the connection and update max_fd count = count + 1 move to the next client Figure 1: Flowchart of the Queue Server4. SOFTWARE IMPLEMENTATIONWhen a parallel program is run, each parallel processor will have one copy of the executable program,termed a process. One process is assigned as the master process while the remaining processes areworker processes. The MPI default process identifier of the master is 0. In addition to performing thebasic tasks of a worker process, the master process performs additional work involved withcoordinating the tasks among all the workers. Therefore the master process is assigned to run on the
  4. 4. server node, which is the most powerful processor, in terms of both processor speed and corememory, in the AIT Beowulf.A flowchart of the main process computer code for both the master node and the workers nodes ispresented in Figure 2. The analysis procedures can be grouped into five phases, namely, the pre-processing phase, the stiffness matrix formulation phase, the force vector formulation phase, thesolution phase, and the post-processing phase. A custom-made parallel Gaussian elimination equationsolver, developed based on the algorithm presented in [10], is employed in the solution phase sincethe available public domain parallel equation solvers are typically efficient only for banded, sparsematrices, which does not match the dense property of the EFGM global stiffness matrix. MASTER PROCESS WORKER PROCESSES START START dd_input (process the input file) Broadcast the processed input data broadcast Receive the processed input data Connect to the queue server Connect to the queue server ddefg_stiff ddefg_stiff gather (form the stiffness matrix) (form the stiffness matrix) Form the concentrated load vector ddforce ddforce gather (form the distributed load vector) (form the distributed load vector) Assemble the global force vectors master_ddsolve worker_ddsolve collaborate (apply B.C.s then solve eqns) (solve eqns) Write nodal displacements to the output file ddpost ddpost (post-process for desired gather (post-process for desired displacements and stresses) displacements and stresses) Write the post-processed results to the output file Disconnect from the queue server Disconnect from the queue server END END Figure 2: Flowcharts of the Master and Worker Modules5. NUMERICAL RESULTSSeveral 3D, elastostatic examples are solved to illustrate the performance and to verify the validity ofthe parallel EFGM analysis code. The results obtained for each analysis closely matched theanalytical solutions [11], as shown in previous serial EFGM works [1]. Thus, the main focus of thesenumerical examples is to investigate the run-time and efficiency of the parallel implementation of theEFGM. Four test cases, with increasing numbers of degrees of freedom, are analyzed using parallel
  5. 5. processor counts ranging from one to four. The 4.5 NP1specific test cases are listed as: 1) linear 4.0 NP2displacement patch test (336 d.o.f.); 2) cantilever 3.5 NP3beam with end loading (825 d.o.f.); 3) pure bending NP4 Overall Speedupof a thick arch (975 d.o.f.); and 4) perforated 3.0tension strip (2850 d.o.f.). The speedup of the 2.5overall solution process, the computation and 2.0assembly of the global stiffness matrix, and thesolution of the discrete system of equations are 1.5shown in Figures 3 to 5, respectively. When the 1.0number of degrees of freedom is less than 1,000,Figure 4 shows that the speedup of the stiffness 0.5matrix formulation phase gradually approaches the 0.0theoretical limit value which is equal to the number 0 1000 2000 3000of processors used in the analysis. However, the Degrees of Freedomspeedup begins to decrease when the number ofdegrees of freedom exceeds 1,000, apparently due Figure 3: Overall Speedup of the EFGMto the initiation of memory page file swapping on Analysis Codeeach processor. This may occur since the currentimplementation requires the full storage of the global stiffness matrix on each processor. Figure 5shows that the optimal points, in terms of speedup, for the parallel Gaussian elimination solver arenear 350, 550, and 600 equations for two, three, and four processors, respectively. When the numberof equations is greater than 1000, the speedup of the solver begins to decrease. This may be due to thesame reason as in the stiffness matrix formulation phase, that is, memory page file swappingcommences. Hence, it can be concluded that the current implementation is scalable up to 1,000degrees of freedom. 4.5 NP1 2.5 4.0 NP2 NP3 3.5 2.0Stiffness Speedup NP4 Solver Speedup 3.0 2.5 1.5 2.0 1.0 1.5 NP1 1.0 NP2 0.5 NP3 0.5 NP4 0.0 0.0 0 1000 2000 3000 0 1000 2000 3000 Degrees of Freedom Degrees of Freedom Figure 4: Speedup of the Stiffness Figure 5: Speedup of the Gaussian Computation Module Elimination Solver6. CONCLUSIONAIT Beowulf, a high-performance yet low-cost parallel computer assembled from a network ofcommodity personal computers, was established. A parallel implementation of the element-freeGalerkin method was developed on this platform. Four desired properties of parallel software, whichare concurrency, scalability, locality, and modularity, were taken into account during the design of the
  6. 6. parallel version of the element-free Galerkin method. A dynamic load-balancing algorithm wasutilized for the computation of the structural stiffness matrix and external force vector and a parallelGaussian elimination algorithm was employed in the solution for the nodal unknowns(displacements). Several numerical examples showed that the displacements and stresses obtainedfrom the parallel implementation closely matched the analytical solutions and exactly matchedsolutions obtained by the sequential element-free Galerkin method software. With Qserv, a dynamicload-balancing algorithm, high scalability was obtained for the three-dimensional structuralmechanics problems up to approximately 1,000 degrees of freedom. However, scalability was notachieved for larger problems, due to the requirement of full stiffness matrix storage on each processorwhile only 64 megabytes of memory was available on each worker node. The parallel Gaussianelimination equation solver took less time to solve the system of equation than its sequentialcounterpart. With larger systems of equations, the efficiency of the parallel equation solver tended toincrease because of the increased computation-to-communication ratio. Nevertheless, in the currentimplementation of the parallel EFGM analysis code, when the number of equations was more than1,000, high efficiency was not obtained. Refinement of the memory management algorithms isrecommended so that the parallel EFGM analysis code may be scalable for problem sizes much largerthan 1,000 degrees of freedom.7. REFERENCES[1] T. Belytschko, Y. Krongauz, D. Organ, M. Fleming, and P. Krysl, “Meshless methods: An overview and recent developments”, Computer Methods in Applied Mechanics and Engineering, Vol. 139, No. 1-4, pp. 3-47, 1996.[2] Adeli and O. Kamal, Parallel Processing in Structural Engineering, Elsevier Science Publishers Ltd., U.K., 1993.[3] K. T. Danielson, S. Hao, W. K. Liu, A. Uras, and S. Li, “Parallel computation of meshless methods for explicit dynamic analysis”, Accepted for publication in International Journal for Numerical Methods in Engineering, 1999.[4] Brown, UNIX Distributed Programming, Prentice Hall International (UK) Limited, UK, 1994.[5] P. Merkey, “Beowulf: Introduction & overview”, Center of Excellence in Space Data and Information Sciences, University Space Research Association, Goddard Space Flight Center, Maryland, USA, September 1998, URL:http://www.beowulf.org/intro.html.[6] Baker and R. Buyya, “Cluster computing: The commodity supercomputer”, Software—Practice and Experience, Vol. 29, No. 6, pp. 551-576, 1999.[7] J. Radajewski and D. Eadline, “Beowulf HOWTO”, November 1998, URL:http://www.linux.org/help/ldp/howto/Beowulf-HOWTO.html.[8] W. Gropp and E. Lusk, Users Guide for mpich, a Portable Implementation of MPI, Technical Report ANL-96/6, Argonne National Laboratory, USA, 1996.[9] Stewart and Z. Leyk, Meschach: Matrix Computations in C, Proceedings of the Center for Mathematics and Its Applications, Vol. 32, Australian National University, 1994.[10] Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, The Benjamin/Cummings Publishing Company, Inc., USA, 1994.[11] S. P. Timoshenko and J. N. Goodier, Theory of Elasticity, 3rd ed., McGraw-Hill, 1970.