1. Molecular dynamics (MD) simulations solve equations of motion to generate long time trajectories of molecular systems. However, simulations are limited by the small time step required for integration.
2. Parallelization using domain decomposition and MPI distribution allows MD simulations to be accelerated by distributing computational work across multiple processors. Key algorithms like the fast Fourier transform (FFT) used in particle mesh Ewald (PME) methods for long-range interactions must also be parallelized efficiently.
3. Two common parallelization schemes are the replicated data approach and domain decomposition. Domain decomposition distributes spatial domains across processors and has better parallel efficiency but is more complex to implement. Hybrid parallelization using MPI and OpenMP can further improve performance.
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
Parallelization of molecular dynamics simulations allows for longer timescales to be reached. There are two main approaches to parallelization - replicated data and domain decomposition. Domain decomposition divides the simulation space into subdomains and assigns particles to processors to minimize communication costs. It provides better parallel efficiency than replicated data. Popular molecular dynamics codes like GROMACS and NAMD use domain decomposition approaches. Efficient parallelization of the reciprocal space calculation in particle-mesh Ewald methods is also important for performance. Two-dimensional decomposition of the 3D fast Fourier transforms provides higher parallel scaling than one-dimensional decomposition.
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
Large-Scale Learning with Less RAM via Randomization proposes algorithms that reduce memory usage for machine learning models during training and prediction while maintaining prediction accuracy. It introduces a method called randomized rounding that represents model weights with fewer bits by randomly rounding values to the nearest representation. An algorithm is proposed that uses randomized rounding and adaptive learning rates on a per-coordinate basis, providing theoretical guarantees on regret bounds. Memory usage is reduced by 50% during training and 95% during prediction compared to standard floating point representation.
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
The document summarizes a paper about modeling quantum interactions using a continuous-filter convolutional neural network called SchNet. Some key points:
1) SchNet performs convolution using distances between nodes in 3D space rather than graph connectivity, allowing it to model interactions between arbitrarily positioned nodes.
2) This is useful for cases where graphs have different configurations that impact properties, or where graph and physical distances differ.
3) The paper proposes a continuous-filter convolutional layer and interaction block to incorporate distance information into graph convolutions performed by the SchNet model.
A novel particle swarm optimization for papr reduction of ofdm systemsaliasghar1989
This document summarizes a research paper that proposes a new particle swarm optimization (PPSO) technique to reduce the computational complexity of the original PSO (OPSO) method for phase optimization in partial transmit sequence (PTS) peak-to-average power ratio (PAPR) reduction schemes for orthogonal frequency division multiplexing (OFDM) systems. Simulation results show that PPSO achieves nearly the same PAPR performance as OPSO but with lower complexity, as it removes the need for random variables and exhaustive searching in phase factor selection for PTS. The complexity is reduced further as the number of particle generations and sub-blocks increases.
The document proposes using fuzzy logic to solve end effects in the Hilbert-Huang transform (HHT). It discusses (1) using empirical mode decomposition to decompose signals but end effects are a problem, (2) training a fuzzy logic system on data pairs to generate rules for extending data, (3) using the fuzzy system to predict extensions for original time series data before applying EMD, (4) results show the fuzzy approach improved upon other methods like mirroring or symmetrization, and (5) conclusions are that fuzzy logic is effective for the end effects problem and future work involves more complex signals and membership functions.
An agent based particle swarm optimization for papr reduction of ofdm systemsaliasghar1989
This document proposes an agent-based particle swarm optimization (APSO) algorithm to reduce the computational complexity of the original particle swarm optimization (OPSO) technique for partial transmit sequence (PTS) peak-to-average power ratio (PAPR) reduction in orthogonal frequency division multiplexing (OFDM) systems. Simulation results show that APSO achieves nearly the same PAPR reduction performance as OPSO but with significantly lower complexity, as the number of additions and multiplications is reduced by setting the velocity of all particles equal to the velocity of the agent particle in each iteration. APSO is thus an effective method to solve the phase optimization problem in PTS with lower complexity than OPSO.
Comparison and analysis of combining techniques for spatial multiplexing spac...IAEME Publication
This document compares different combining techniques for space-time block coded systems in Rayleigh fading channels. It finds that maximum ratio combining outperforms other techniques like equal gain combining and selection combining for any space-time block code configuration, providing the best bit error rate. The document provides background on space-time block codes, describes the Alamouti space-time code, and discusses various receive diversity combining techniques.
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
Parallelization of molecular dynamics simulations allows for longer timescales to be reached. There are two main approaches to parallelization - replicated data and domain decomposition. Domain decomposition divides the simulation space into subdomains and assigns particles to processors to minimize communication costs. It provides better parallel efficiency than replicated data. Popular molecular dynamics codes like GROMACS and NAMD use domain decomposition approaches. Efficient parallelization of the reciprocal space calculation in particle-mesh Ewald methods is also important for performance. Two-dimensional decomposition of the 3D fast Fourier transforms provides higher parallel scaling than one-dimensional decomposition.
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
Large-Scale Learning with Less RAM via Randomization proposes algorithms that reduce memory usage for machine learning models during training and prediction while maintaining prediction accuracy. It introduces a method called randomized rounding that represents model weights with fewer bits by randomly rounding values to the nearest representation. An algorithm is proposed that uses randomized rounding and adaptive learning rates on a per-coordinate basis, providing theoretical guarantees on regret bounds. Memory usage is reduced by 50% during training and 95% during prediction compared to standard floating point representation.
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
The document summarizes a paper about modeling quantum interactions using a continuous-filter convolutional neural network called SchNet. Some key points:
1) SchNet performs convolution using distances between nodes in 3D space rather than graph connectivity, allowing it to model interactions between arbitrarily positioned nodes.
2) This is useful for cases where graphs have different configurations that impact properties, or where graph and physical distances differ.
3) The paper proposes a continuous-filter convolutional layer and interaction block to incorporate distance information into graph convolutions performed by the SchNet model.
A novel particle swarm optimization for papr reduction of ofdm systemsaliasghar1989
This document summarizes a research paper that proposes a new particle swarm optimization (PPSO) technique to reduce the computational complexity of the original PSO (OPSO) method for phase optimization in partial transmit sequence (PTS) peak-to-average power ratio (PAPR) reduction schemes for orthogonal frequency division multiplexing (OFDM) systems. Simulation results show that PPSO achieves nearly the same PAPR performance as OPSO but with lower complexity, as it removes the need for random variables and exhaustive searching in phase factor selection for PTS. The complexity is reduced further as the number of particle generations and sub-blocks increases.
The document proposes using fuzzy logic to solve end effects in the Hilbert-Huang transform (HHT). It discusses (1) using empirical mode decomposition to decompose signals but end effects are a problem, (2) training a fuzzy logic system on data pairs to generate rules for extending data, (3) using the fuzzy system to predict extensions for original time series data before applying EMD, (4) results show the fuzzy approach improved upon other methods like mirroring or symmetrization, and (5) conclusions are that fuzzy logic is effective for the end effects problem and future work involves more complex signals and membership functions.
An agent based particle swarm optimization for papr reduction of ofdm systemsaliasghar1989
This document proposes an agent-based particle swarm optimization (APSO) algorithm to reduce the computational complexity of the original particle swarm optimization (OPSO) technique for partial transmit sequence (PTS) peak-to-average power ratio (PAPR) reduction in orthogonal frequency division multiplexing (OFDM) systems. Simulation results show that APSO achieves nearly the same PAPR reduction performance as OPSO but with significantly lower complexity, as the number of additions and multiplications is reduced by setting the velocity of all particles equal to the velocity of the agent particle in each iteration. APSO is thus an effective method to solve the phase optimization problem in PTS with lower complexity than OPSO.
Comparison and analysis of combining techniques for spatial multiplexing spac...IAEME Publication
This document compares different combining techniques for space-time block coded systems in Rayleigh fading channels. It finds that maximum ratio combining outperforms other techniques like equal gain combining and selection combining for any space-time block code configuration, providing the best bit error rate. The document provides background on space-time block codes, describes the Alamouti space-time code, and discusses various receive diversity combining techniques.
Exploration of genetic network programming with two-stage reinforcement learn...TELKOMNIKA JOURNAL
This paper observes the exploration of Genetic Network Programming Two-Stage Reinforcement Learning for mobile robot navigation. The proposed method aims to observe its exploration when inexperienced environments used in the implementation. In order to deal with this situation, individuals are trained firstly in the training phase, that is, they learn the environment with ϵ-greedy policy and learning rate α parameters. Here, two cases are studied, i.e., case A for low exploration and case B for high exploration. In the implementation, the individuals implemented to get experience and learn a new environment on-line. Then, the performance of learning processes are observed due to the environmental changes.
The document summarizes research on improving the training of multilayer perceptron (MLP) neural networks. It proposes using multiple optimal learning factors (MOLF) during training, which is shown to be equivalent to optimally transforming the net function vector in the MLP. For large networks, the MOLF Hessian matrix can become large, so the paper develops a method to compress the matrix into a smaller, well-conditioned form. Simulation results show the proposed algorithm performs almost as well as Levenberg-Marquardt but with the computational complexity of a first-order method.
SWARM INTELLIGENCE SCHEDULING OF SOFT REAL-TIME TASKS IN HETEROGENEOUS MULTIP...ecij
The document presents a hybrid swarm intelligence algorithm called VNABCSA for scheduling soft real-time tasks in heterogeneous multiprocessor systems. VNABCSA combines artificial bee colony and simulated annealing algorithms. It aims to minimize total tardiness, number of processors used, completion time, total waiting time of tasks and processors. The algorithm represents solutions as an ordering of tasks and assignment to processors. It uses artificial bee colony for global search and simulated annealing for local search to improve convergence. Simulation results show it performs better than existing scheduling algorithms.
An Improved Empirical Mode Decomposition Based On Particle Swarm OptimizationIJRES Journal
End effect is the main factor affecting the application of Empirical Mode Decomposition (EMD).
This paper presents an improve EMD for decomposing short signal. First, analyzing the frequency components
of signal to be decomposed, and construct the parameter equation with the amplitude and initial phase of signal
as unknowns. Second, employing particle swarm optimization (PSO) to estimate the unknown parameters, and
extending the inspected signal according to the obtained parameters. Thirdly, using EMD to decompose the
extended signal into a series of intrinsic mode functions (IMFs) and a residual. The IMFs of original signal are
extracted from these obtained IMFs. The correlation coefficients between the IMFs and the signal are calculated
to judge the pseudo-IMFs. The simulation result shows that the presented method is effective and extends the
application of EMD.
Resourceful fast dht algorithm for vlsi implementation by split radix algorithmeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...IDES Editor
This paper presents Chemo-tactic PSO-DE
(CPSO-DE) optimization algorithm combined with
Lagrange Relaxation method (LR) for solving Unit
Commitment (UC) problem. The proposed approach
employs Chemo-tactic PSO-DE algorithm for optimal
settings of Lagrange multipliers. It provides high-quality
performance and reaches global solution and is a hybrid
heuristic algorithm based on Bacterial Foraging
Optimization (BFO), Particle Swarm Optimization (PSO)
and Differential Evolution (DE). The feasibility of the
proposed method is demonstrated for 10-unit, 20-unit,
and 40-unit systems respectively. The test results are
compared with those obtained by Lagrangian relaxation
(LR), genetic algorithm (GA), evolutionary programming
(EP), and genetic algorithm based on unit characteristic
classification (GAUC), enhanced adaptive Lagrangian
relaxation (ELR), integer-coded genetic algorithm
(ICGA) and hybrid particle swarm optimization (HPSO)
in terms of solution quality. Simulation results show that
the proposed method can provide a better solution.
This document discusses regularization techniques for neural networks to prevent overfitting. It describes how the number of hidden units controls complexity and can lead to underfitting or overfitting. Early stopping is introduced as an alternative to regularization where training is stopped when validation error starts to increase. Consistency of regularization terms under linear transformations of data is discussed. Different approaches for building invariances like data augmentation, tangent propagation, and convolutional network structures are outlined.
Writing distributed N-body code using distributed FFT - 1kr0y
This document discusses using distributed Fast Fourier Transforms (FFTs) to write a fully distributed N-body code for cosmological simulations. N-body codes are used to simulate structure formation by evolving a system of particles under physical forces. FFTs reduce the computation time from order N^2 to order N log N. However, large data volumes still require long computation times. Distributing the computations across multiple processors decreases time by running calculations in parallel. The proposed project will develop an N-body code that distributes FFT calculations across a cluster to provide faster computation at lower cost.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
Multi carrier equalization by restoration of redundanc y (merry) for adaptive...IJNSA Journal
This paper proposes a new blind adaptive channel shortening approach for multi-carrier systems. The
performance of the discrete Fourier transform-DMT (DFT-DMT) system is investigated with the proposed
DST-DMT system over the standard carrier serving area (CSA) loop1. Enhanced bit rates demonstrated
and less complexity also involved by the simulation of the DST-DMT system.
The document presents a new approach called FPERT (Fuzzy PERT) for project network analysis that accounts for uncertainty in activity times. It begins with an overview of FPERT and its advantages over conventional PERT. It then discusses key concepts needed for FPERT like fuzzy sets, membership functions, and α-cuts. The document outlines the steps of the proposed FPERT method and provides an example calculation. It concludes by introducing notation that will be used to calculate earliest start, earliest finish, latest start and latest finish times for activities.
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd Iaetsd
This document discusses using the CORDIC algorithm to implement a pipelined FFT for fingerprint recognition on an FPGA. It proposes a hardware-efficient CORDIC FFT architecture that minimizes computational complexity. The CORDIC algorithm replaces complex multipliers with shift-add operations, providing a simpler hardware implementation than traditional multiplier-based FFT designs. The architecture includes a butterfly structure implemented with CORDIC, an angle generator for twiddle factors, and input/output blocks with registers and multiplexers to enable pipelined processing.
The document summarizes a presentation on minimizing tensor estimation error using alternating minimization. It begins with an introduction to tensor decompositions including CP, Tucker, and tensor train decompositions. It then discusses nonparametric tensor estimation using an alternating minimization method. The method iteratively updates components while holding other components fixed, achieving efficient computation. The analysis shows that after t iterations, the estimation error is bounded by the sum of a statistical error term and an optimization error term decaying exponentially in t. Real data analysis uses the method for multitask learning.
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401, Automatic Application Tuning for HPC Architectures, Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
Abstract: In this paper a digital image coding technique called ML-SRHWT (Machine Learning based image coding by Superlative Rapid HAAR Wavelet Transform) has been introduced. Compression of digital image is done using the model Superlative Rapid HAAR Wavelet Transform (SRHWT). The Least Square Support vector Machine regression predicts hyper coefficients obtained by using QPSO model. The mathematical models are discussed in brief in this paper are SRHWT, which results in good performance and reduces the complexity compared to FHAAR and EQPSO by replacing the least good particle with the new best obtained particle in QPSO. On comparing the ML-SRHWT with JPEG and JPEG2000 standards, the former is considered to be the better.
Huge-Scale Molecular Dynamics Simulation of Multi-bubble NucleiHiroshi Watanabe
This document summarizes a molecular dynamics simulation of multi-bubble nuclei using the K computer supercomputer. The simulation directly observed the interaction and Ostwald ripening of billions of bubbles without hierarchical modeling assumptions. Analysis of bubble size distributions and scaling behaviors over time validated predictions of Lifshitz-Slyozov-Wagner theory, demonstrating the simulation captured relevant multi-scale and multi-physics phenomena. The huge-scale simulation was necessary to study bubble population dynamics and obtain reliable statistical results.
This document summarizes a research paper that proposes using the firefly algorithm to solve the NP-hard optimization problem of assigning cells in a cellular network to switches in a way that minimizes total costs. The total cost includes handoff costs for calls between adjacent cells and cabling costs for connecting cells to switches, subject to constraints like switch capacity. It describes modeling the problem mathematically and comparing the firefly algorithm to the particle swarm optimization (PSO) algorithm. The firefly algorithm is shown to solve the cell-to-switch assignment problem faster than existing approaches like PSO.
Improving The Performance of Viterbi Decoder using Window System IJECEIAES
An efficient Viterbi decoder is introduced in this paper; it is called Viterbi decoder with window system. The simulation results, over Gaussian channels, are performed from rate 1/2, 1/3 and 2/3 joined to TCM encoder with memory in order of 2, 3. These results show that the proposed scheme outperforms the classical Viterbi by a gain of 1 dB. On the other hand, we propose a function called RSCPOLY2TRELLIS, for recursive systematic convolutional (RSC) encoder which creates the trellis structure of a recursive systematic convolutional encoder from the matrix “H”. Moreover, we present a comparison between the decoding algorithms of the TCM encoder like Viterbi soft and hard, and the variants of the MAP decoder known as BCJR or forward-backward algorithm which is very performant in decoding TCM, but depends on the size of the code, the memory, and the CPU requirements of the application.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document provides the solutions to selected problems from the textbook "Introduction to Parallel Computing". The solutions are supplemented with figures where needed. Figure and equation numbers are represented in roman numerals to differentiate them from the textbook. The document contains solutions to problems from 13 chapters of the textbook covering topics in parallel computing models, algorithms, and applications.
A Single-Phase Clock Multiband Low-Power Flexible Dividerijsrd.com
In this paper, a low-power single-phase clock multiband flexible divider for Bluetooth, Zigbee, and IEEE 802.15.4 and 802.11 a/b/g WLAN frequency synthesizers The frequency synthesizer was implemented using a charge-pump based phase-locked loop with a tri-state phase/frequency detector and a programmable pulse-swallow frequency divider. Since the required frequency of operation can be as high as 1.4GHz, the speed of the digital logic used in the frequency divider is a critical design factor. A custom library of digital logic gates was designed using MOS current-mode logic (MCML). These gates were designed to operate at frequencies up to 1.4GHz. This report outlines the design of the phase/frequency detector and the programmable pulse-swallow frequency divider. The design, layout, and simulation of the MCML logic family are also presented.
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...IJRESJOURNAL
With the development of productivity and the fast growth of the economy, environmental pollution, resource utilization and low product recovery rate have emerged subsequently, so more and more attention has been paid to the recycling and reuse of products. However, since the complexity of disassembly line balancing problem (DLBP) increases with the number of parts in the product, finding the optimal balance is computationally intensive. In order to improve the computational ability of particle swarm optimization (PSO) algorithm in solving DLBP, this paper proposed an improved adaptive multi-objective particle swarm optimization (IAMOPSO) algorithm. Firstly, the evolution factor parameter is introduced to judge the state of evolution using the idea of fuzzy classification and then the feedback information from evolutionary environment is served in adjusting inertia weight, acceleration coefficients dynamically. Finally, a dimensional learning strategy based on information entropy is used in which each learning object is uncertain. The results from testing in using series of instances with different size verify the effect of proposed algorithm.
Exploration of genetic network programming with two-stage reinforcement learn...TELKOMNIKA JOURNAL
This paper observes the exploration of Genetic Network Programming Two-Stage Reinforcement Learning for mobile robot navigation. The proposed method aims to observe its exploration when inexperienced environments used in the implementation. In order to deal with this situation, individuals are trained firstly in the training phase, that is, they learn the environment with ϵ-greedy policy and learning rate α parameters. Here, two cases are studied, i.e., case A for low exploration and case B for high exploration. In the implementation, the individuals implemented to get experience and learn a new environment on-line. Then, the performance of learning processes are observed due to the environmental changes.
The document summarizes research on improving the training of multilayer perceptron (MLP) neural networks. It proposes using multiple optimal learning factors (MOLF) during training, which is shown to be equivalent to optimally transforming the net function vector in the MLP. For large networks, the MOLF Hessian matrix can become large, so the paper develops a method to compress the matrix into a smaller, well-conditioned form. Simulation results show the proposed algorithm performs almost as well as Levenberg-Marquardt but with the computational complexity of a first-order method.
SWARM INTELLIGENCE SCHEDULING OF SOFT REAL-TIME TASKS IN HETEROGENEOUS MULTIP...ecij
The document presents a hybrid swarm intelligence algorithm called VNABCSA for scheduling soft real-time tasks in heterogeneous multiprocessor systems. VNABCSA combines artificial bee colony and simulated annealing algorithms. It aims to minimize total tardiness, number of processors used, completion time, total waiting time of tasks and processors. The algorithm represents solutions as an ordering of tasks and assignment to processors. It uses artificial bee colony for global search and simulated annealing for local search to improve convergence. Simulation results show it performs better than existing scheduling algorithms.
An Improved Empirical Mode Decomposition Based On Particle Swarm OptimizationIJRES Journal
End effect is the main factor affecting the application of Empirical Mode Decomposition (EMD).
This paper presents an improve EMD for decomposing short signal. First, analyzing the frequency components
of signal to be decomposed, and construct the parameter equation with the amplitude and initial phase of signal
as unknowns. Second, employing particle swarm optimization (PSO) to estimate the unknown parameters, and
extending the inspected signal according to the obtained parameters. Thirdly, using EMD to decompose the
extended signal into a series of intrinsic mode functions (IMFs) and a residual. The IMFs of original signal are
extracted from these obtained IMFs. The correlation coefficients between the IMFs and the signal are calculated
to judge the pseudo-IMFs. The simulation result shows that the presented method is effective and extends the
application of EMD.
Resourceful fast dht algorithm for vlsi implementation by split radix algorithmeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...IDES Editor
This paper presents Chemo-tactic PSO-DE
(CPSO-DE) optimization algorithm combined with
Lagrange Relaxation method (LR) for solving Unit
Commitment (UC) problem. The proposed approach
employs Chemo-tactic PSO-DE algorithm for optimal
settings of Lagrange multipliers. It provides high-quality
performance and reaches global solution and is a hybrid
heuristic algorithm based on Bacterial Foraging
Optimization (BFO), Particle Swarm Optimization (PSO)
and Differential Evolution (DE). The feasibility of the
proposed method is demonstrated for 10-unit, 20-unit,
and 40-unit systems respectively. The test results are
compared with those obtained by Lagrangian relaxation
(LR), genetic algorithm (GA), evolutionary programming
(EP), and genetic algorithm based on unit characteristic
classification (GAUC), enhanced adaptive Lagrangian
relaxation (ELR), integer-coded genetic algorithm
(ICGA) and hybrid particle swarm optimization (HPSO)
in terms of solution quality. Simulation results show that
the proposed method can provide a better solution.
This document discusses regularization techniques for neural networks to prevent overfitting. It describes how the number of hidden units controls complexity and can lead to underfitting or overfitting. Early stopping is introduced as an alternative to regularization where training is stopped when validation error starts to increase. Consistency of regularization terms under linear transformations of data is discussed. Different approaches for building invariances like data augmentation, tangent propagation, and convolutional network structures are outlined.
Writing distributed N-body code using distributed FFT - 1kr0y
This document discusses using distributed Fast Fourier Transforms (FFTs) to write a fully distributed N-body code for cosmological simulations. N-body codes are used to simulate structure formation by evolving a system of particles under physical forces. FFTs reduce the computation time from order N^2 to order N log N. However, large data volumes still require long computation times. Distributing the computations across multiple processors decreases time by running calculations in parallel. The proposed project will develop an N-body code that distributes FFT calculations across a cluster to provide faster computation at lower cost.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
Multi carrier equalization by restoration of redundanc y (merry) for adaptive...IJNSA Journal
This paper proposes a new blind adaptive channel shortening approach for multi-carrier systems. The
performance of the discrete Fourier transform-DMT (DFT-DMT) system is investigated with the proposed
DST-DMT system over the standard carrier serving area (CSA) loop1. Enhanced bit rates demonstrated
and less complexity also involved by the simulation of the DST-DMT system.
The document presents a new approach called FPERT (Fuzzy PERT) for project network analysis that accounts for uncertainty in activity times. It begins with an overview of FPERT and its advantages over conventional PERT. It then discusses key concepts needed for FPERT like fuzzy sets, membership functions, and α-cuts. The document outlines the steps of the proposed FPERT method and provides an example calculation. It concludes by introducing notation that will be used to calculate earliest start, earliest finish, latest start and latest finish times for activities.
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd Iaetsd
This document discusses using the CORDIC algorithm to implement a pipelined FFT for fingerprint recognition on an FPGA. It proposes a hardware-efficient CORDIC FFT architecture that minimizes computational complexity. The CORDIC algorithm replaces complex multipliers with shift-add operations, providing a simpler hardware implementation than traditional multiplier-based FFT designs. The architecture includes a butterfly structure implemented with CORDIC, an angle generator for twiddle factors, and input/output blocks with registers and multiplexers to enable pipelined processing.
The document summarizes a presentation on minimizing tensor estimation error using alternating minimization. It begins with an introduction to tensor decompositions including CP, Tucker, and tensor train decompositions. It then discusses nonparametric tensor estimation using an alternating minimization method. The method iteratively updates components while holding other components fixed, achieving efficient computation. The analysis shows that after t iterations, the estimation error is bounded by the sum of a statistical error term and an optimization error term decaying exponentially in t. Real data analysis uses the method for multitask learning.
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401, Automatic Application Tuning for HPC Architectures, Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
Abstract: In this paper a digital image coding technique called ML-SRHWT (Machine Learning based image coding by Superlative Rapid HAAR Wavelet Transform) has been introduced. Compression of digital image is done using the model Superlative Rapid HAAR Wavelet Transform (SRHWT). The Least Square Support vector Machine regression predicts hyper coefficients obtained by using QPSO model. The mathematical models are discussed in brief in this paper are SRHWT, which results in good performance and reduces the complexity compared to FHAAR and EQPSO by replacing the least good particle with the new best obtained particle in QPSO. On comparing the ML-SRHWT with JPEG and JPEG2000 standards, the former is considered to be the better.
Huge-Scale Molecular Dynamics Simulation of Multi-bubble NucleiHiroshi Watanabe
This document summarizes a molecular dynamics simulation of multi-bubble nuclei using the K computer supercomputer. The simulation directly observed the interaction and Ostwald ripening of billions of bubbles without hierarchical modeling assumptions. Analysis of bubble size distributions and scaling behaviors over time validated predictions of Lifshitz-Slyozov-Wagner theory, demonstrating the simulation captured relevant multi-scale and multi-physics phenomena. The huge-scale simulation was necessary to study bubble population dynamics and obtain reliable statistical results.
This document summarizes a research paper that proposes using the firefly algorithm to solve the NP-hard optimization problem of assigning cells in a cellular network to switches in a way that minimizes total costs. The total cost includes handoff costs for calls between adjacent cells and cabling costs for connecting cells to switches, subject to constraints like switch capacity. It describes modeling the problem mathematically and comparing the firefly algorithm to the particle swarm optimization (PSO) algorithm. The firefly algorithm is shown to solve the cell-to-switch assignment problem faster than existing approaches like PSO.
Improving The Performance of Viterbi Decoder using Window System IJECEIAES
An efficient Viterbi decoder is introduced in this paper; it is called Viterbi decoder with window system. The simulation results, over Gaussian channels, are performed from rate 1/2, 1/3 and 2/3 joined to TCM encoder with memory in order of 2, 3. These results show that the proposed scheme outperforms the classical Viterbi by a gain of 1 dB. On the other hand, we propose a function called RSCPOLY2TRELLIS, for recursive systematic convolutional (RSC) encoder which creates the trellis structure of a recursive systematic convolutional encoder from the matrix “H”. Moreover, we present a comparison between the decoding algorithms of the TCM encoder like Viterbi soft and hard, and the variants of the MAP decoder known as BCJR or forward-backward algorithm which is very performant in decoding TCM, but depends on the size of the code, the memory, and the CPU requirements of the application.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document provides the solutions to selected problems from the textbook "Introduction to Parallel Computing". The solutions are supplemented with figures where needed. Figure and equation numbers are represented in roman numerals to differentiate them from the textbook. The document contains solutions to problems from 13 chapters of the textbook covering topics in parallel computing models, algorithms, and applications.
A Single-Phase Clock Multiband Low-Power Flexible Dividerijsrd.com
In this paper, a low-power single-phase clock multiband flexible divider for Bluetooth, Zigbee, and IEEE 802.15.4 and 802.11 a/b/g WLAN frequency synthesizers The frequency synthesizer was implemented using a charge-pump based phase-locked loop with a tri-state phase/frequency detector and a programmable pulse-swallow frequency divider. Since the required frequency of operation can be as high as 1.4GHz, the speed of the digital logic used in the frequency divider is a critical design factor. A custom library of digital logic gates was designed using MOS current-mode logic (MCML). These gates were designed to operate at frequencies up to 1.4GHz. This report outlines the design of the phase/frequency detector and the programmable pulse-swallow frequency divider. The design, layout, and simulation of the MCML logic family are also presented.
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...IJRESJOURNAL
With the development of productivity and the fast growth of the economy, environmental pollution, resource utilization and low product recovery rate have emerged subsequently, so more and more attention has been paid to the recycling and reuse of products. However, since the complexity of disassembly line balancing problem (DLBP) increases with the number of parts in the product, finding the optimal balance is computationally intensive. In order to improve the computational ability of particle swarm optimization (PSO) algorithm in solving DLBP, this paper proposed an improved adaptive multi-objective particle swarm optimization (IAMOPSO) algorithm. Firstly, the evolution factor parameter is introduced to judge the state of evolution using the idea of fuzzy classification and then the feedback information from evolutionary environment is served in adjusting inertia weight, acceleration coefficients dynamically. Finally, a dimensional learning strategy based on information entropy is used in which each learning object is uncertain. The results from testing in using series of instances with different size verify the effect of proposed algorithm.
I studied in Indian Institute of Technology, Kharagpur, India. I did my B.Texh and M.Tech in the department of Electronics and Electrical Communication Engineering. I was student of 2018 batch. After that, I joined Schneider Electric Systems India Private limited Company as Software design Engineer. Currently I am designated as Senior Firmware Engineer in the same company. I have work experience of 4+ years. The uploaded ppt is my MTP Thesis. It is about "temperature aware application mapping on to mesh based network on chip using Genetic Algorithm".
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...IOSRJVSP
This document presents a neural network approach to channel equalization using a multilayer perceptron with a variable learning rate parameter. Specifically, it proposes modifying the backpropagation algorithm to allow the learning rate to adapt at each iteration in order to achieve faster convergence. The equalizer structure is a decision feedback equalizer modeled as a neural network with an input, hidden and output layer. Simulation results show the proposed variable learning rate approach improves bit error rate and convergence speed compared to a standard backpropagation algorithm.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
In this paper, we propose a novel advanced multi-rate design for evolved Multimedia Multicast/Broadcast Service (eMBMS) in fourth generation (4G) Long-Term Evolution (LTE)/LTE-Advanced (LTE-A) networks. The proposed design provides: i) reliability, based on random network coded (RNC) transmission, and ii) efficiency, obtained by optimized rate allocation across multi-rate RNC streams. The paper provides an in-depth description of the system realization and demonstrates the feasibility of the proposed eMBMS design using both analytical and simulation results. The system performance is compared with popular multi-rate multicast approaches in a realistic simulated LTE/LTE-A environment.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
The document discusses efficient VLSI implementations of image encryption using minimal operations. It proposes using discrete cosine transform (DCT) for image compression and encryption simultaneously. For encryption, a linear feedback shift register generates random numbers added to some DCT outputs. The DCT algorithm and arithmetic operators are optimized to reduce operations and increase throughput. Simulation results show encryption in the frequency domain at 656 million samples per second on an 82 MHz clock.
Efficient methods for accurately calculating thermoelectric properties – elec...Anubhav Jain
1) AMSET is a new method for efficiently calculating electronic transport properties from first principles that provides accurate results comparable to more computationally expensive methods.
2) HiPhive uses a data fitting approach to extract interatomic force constants from a small number of non-systematic displacement calculations, avoiding the need for many systematic calculations required by traditional methods to obtain phonon and thermal properties.
3) These new efficient methods enable high-throughput screening of thermoelectric materials by providing accurate transport properties while being computationally feasible for large numbers of materials.
This document summarizes basic communication operations for parallel computing including:
- One-to-all broadcast and all-to-one reduction which involve sending a message from one processor to all others or combining messages from all processors to one.
- All-to-all broadcast and reduction where all processors simultaneously broadcast or reduce messages.
- Collective operations like all-reduce and prefix-sum which combine messages from all processors using associative operators.
- Examples of implementing these operations on different network topologies like rings, meshes and hypercubes are presented along with analyzing their communication costs. The document provides an overview of fundamental communication patterns in parallel computing.
The MSc defense ceremony was held on 6-7-2017 in Mansoura University, Faculty of Engineering. This presentation is shared to help MSc students in Faculty of Engineering prepare their thesis presentation and ease their tension before their presentation time
Design and Analysis of Power and Variability Aware Digital Summing CircuitIDES Editor
This document analyzes and compares different 1-bit digital summing circuit topologies in terms of their robustness against process, voltage, and temperature variations at the 22nm technology node. It finds that the transmission gate-based topology is the most robust, with the tightest spread in propagation delay, power dissipation, and energy-delay product. It then proposes a transmission gate-based digital summing circuit implemented using carbon nanotube field-effect transistors, which offers even greater robustness against PVT variations compared to an implementation using traditional MOSFETs.
This document discusses using a C8051 MCU to calculate FFTs in real-time for audio spectrum analysis. It introduces:
(1) Key techniques used in the FFT software to optimize efficiency and accuracy, including avoiding multiplications, 16-bit integer storage, and using the MCU's on-chip PLL.
(2) The FFT algorithm and techniques like bit-reversal, windowing and anti-aliasing filtering.
(3) Details of the software design and implementation, including interrupt service routines to read ADC samples, windowing functions, bit-reversal indexing, and FFT computation loops.
(4) Experimental results analyzing the performance under different configurations and analyzing properties like
This document proposes a 3T XOR gate design using only pMOS transistors. It has lower power-delay product (PDP) than existing 3T XOR gate designs. As an application, a 2x2 array multiplier is designed using the proposed XOR gate. Simulations on a 45nm process show the multiplier using the proposed gate has lower PDP than the existing design, especially at higher voltages and temperatures. The proposed design also has better noise immunity. In summary, the document introduces a new low-power 3T pMOS XOR gate and demonstrates its advantages in a 2x2 array multiplier application through simulations.
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...IJTET Journal
An efficient architecture for the implementation of a delayed least mean square adaptive filter. A Novel
partial product Generator is achieving lower adaptation-delay and Area delay consumption and propose a strategy
for optimized balanced pipelining across the time-consuming combinational blocks of the structure. From synthesis
results, the proposed design will offers less area-delay product (ADP) the best of the existing systolic structures, on
average, for filter lengths N =8, 16, and 32. An efficient fixed-point implementation scheme of the proposed
architecture, The EEG(electroencephalogram) is used for recording of electrical activity of the brain .During
recording the EEG is contaminated by various artifacts as PLI(Power line interference), MA(Muscle artifact),
EBA(Eye blink artifact). This paper gives Detail of various artifacts which occur in EEG signal. In this we study
adaptive filter for reducing the EBA (eye blink artifact) noise from the EEG signal and to increase SNR (Signal to
noise ratio).the analytical result matches with the simulation result is showed.
Ensemble Empirical Mode Decomposition: An adaptive method for noise reductionIOSR Journals
Abstract:Empirical mode decomposition (EMD), a data analysis technique, is used to denoise non-stationary and non-linear processes. The method does not require any pre & post processing of signal and use of any specified basis functions. But EMD suffers from a problem called mode mixing. So to overcome this problem a new method known as Ensemble Empirical mode decomposition (EEMD) has been introduced. The presented paper gives the detail of EEMD and its application in various fields. EEMD is a time–space analysis method, in which the added white noise is averaged out with sufficient number of trials; and the averaging process results in only the component of the signal (original data). EEMD is a truly noise-assisted data analysis (NADA) method and represents a substantial improvement over the original EMD. Keywords –Data analysis, Empirical mode decomposition, intrinsic mode function, mode mixing, NADA,
Similar to El text.tokuron a(2019).jung190711 (20)
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...RCCSRENKEI
The document is very short and does not provide much context. It contains only one word: "Real". From this limited information, it is difficult to construct an informative summary in 3 sentences or less.
Current status of the project "Toward a unified view of the universe: from la...RCCSRENKEI
This document summarizes the current status of a large, multi-institutional project in Japan aimed at developing a unified understanding of structure formation in the universe through multi-level simulations and observations. The project involves over 90 researchers across 21 institutions. It is divided into four sub-projects focusing on: large-scale structures and galaxy formation (Sub A); molecular clouds and planetary formation (Sub B); black holes, supernovae, and radiation transport (Sub C); and the solar system, Venus, and gas giant planets (Sub D). Several key simulations have been performed achieving unprecedented resolution, including galaxy formation at the star-by-star level, globular cluster dynamics, and a 12.8 billion point simulation of solar conve
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
The document summarizes the successes and lessons learned from Fugaku, Japan's flagship supercomputer. Key points include:
- Fugaku achieved the top performance on all HPC benchmarks in 2020 and 2021, showing high performance across applications, not just traditional HPC workloads.
- While many applications achieved their target performance, some did not due to issues like insufficient parallelism, I/O scalability problems, and compiler vectorization failures.
- Lessons include the need for improved software stacks, application analysis, and adapting to modern applications beyond classic HPC.
- Looking ahead, sustained exascale performance will require data-centric architectures and corresponding system software and algorithms as transistor scaling slow
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
3. Molecular Dynamics (MD)
1. Energy/forces are described by classical molecular mechanics force field.
2. Update state according to equations of motion
Long time MD trajectories are important to obtain
thermodynamic quantities of target systems.
Equation of
motion
Long time MD trajectory
=> Ensemble generation
Integration
4. Potential energy in MD
4
2
total 0
bonds
2
0
angles
dihedrals
12 61
0 0
1 1
( )
( )
[1 cos( )]
2
b
a
n
N N
ij ij i j
ij
ij ij ijj i j
E k b b
k
V n
r r q q
r r r
O(N)
O(N)
O(N)
O(N2)
Main bottleneck in MD
12 6 2 2
0 0
2
0
erfc( ) exp( / 4 )
2 FFT( ( ))
ij ij i j ij
ij
ij ij iji j R
r r q q r
Q
r r r
k
k
k
k
Real space, O(CN) Reciprocal space, O(NlogN)
Total number of particles
5. Non-bonded interaction
1. Non-bond energy calculation is reduced by introducing cutoff
2. The electrostatic energy calculation beyond cutoff will be done in the reciprocal space
with FFT
3. Further, it could be reduced by properly distributing over parallel processes, in particular
good domain decomposition scheme.
5
| |
Real part Reciprocal part Self energy
O(N2)
O(N1)
6. Difficulty to perform long time MD simulation
1. One time step length (Δt) is limited to 1-2 fs due to vibrations.
2. On the other hand, biologically meaningful events occur on the time scale of
milliseconds or longer.
fs ps ns μs ms sec
vibrations
Sidechain motions
Mainchain motions
Folding
Protein global motions
7. How to accelerate MD simulations?
=> Parallelization
Serial Parallel
16 cpus
X16?
1cpu
1cpu
Good Parallelization ;
1) Small amount of
computation in one process
2) Small communicational cost
C
CPU
Core
MPI
Comm
9. Shared memory parallelization (OpenMP)
Memory
P1 P2 P3 PP-1 PP
• All processes share data in memory
• For efficient parallelization, processes should not access the same memory address.
• It is only available for multi-processors in a physical node
10. Distributed memory parallelization (MPI)
M1
P1
M2
P2
M3
P3
MP-1
PP-1
MP
PP
…
• Processors do not share data in memory
• We need to send/receive data via communications
• For efficient parallelization, the amount of communication data should be minimized
12. SIMD (Single instruction, multiple data)
• Same operation on multiple data points simultaneously
• Usually applicable to common tasks like adjusting graphic image or volume
• In most MD programs, SIMD becomes the one of the important topics to increase the
performance
SIMDNo SIMD
4 times faster
13. SIMT (Single instruction, multiple threads)
• SIMD combined with multithreading
• SIMT execution model is usually implemented on GPUs and related with GPGPU
(General Purpose computing on Graphics Processing Units)
• Currently, CUDA allows 32 threads for SIMT (warp size =32)
Grid 1
Block
(0,0)
Block
(0,1)
Block
(1,1)
Block
(1,0)
Block (1,0)
Thread
(0,0)
Thread
(0,1)
Thread
(1,0)
Thread
(1,1)
Thread
(1,2)
Thread
(2,0)
Thread
(2,1)
Thread
(3,0)
Thread
(3,1)
Thread
(2,2)
Thread
(3,2)
Thread
(0,2)
15. Parallelization scheme 1: Replicated data approach
1. Each process has a copy of all particle data.
2. Each process works only part of the whole works by proper assign in do loops.
do i = 1, N
do j = i+1, N
energy(i,j)
force(i,j)
end do
end do
my_rank = MPI_Rank
proc = total MPI
do i = my_rank+1, N, proc
do j = i+1, N
energy(i,j)
force(i,j)
end do
end do
MPI reduction (energy,force)
16. atom indices
Computationalcost
MPI rank 1
MPI rank 0
MPI rank 2
MPI rank 3
1 2 3 4 5 6 7 8
Perfect load balance is not guaranteed in this parallelization scheme
Parallelization scheme 1: Replicated data approach
17. Hybrid (MPI+OpenMP) parallelization of the
Replicated data approach
1. Works are distributed over MPI and OpenMP threads.
2. Parallelization is increased by reducing the number of MPIs involved in
communications.
my_rank = MPI_Rank
proc = total MPI
do i = my_rank+1, N, proc
do j = i+1, N
energy(i,j)
force(i,j)
end do
end do
MPI reduction
(energy,force)
my_rank = MPI_Rank
proc = total MPI
nthread = total OMP thread
!$omp parallel
id = omp thread id
my_id = my_rank*nthread + id
do i = my_id+1,N,proc*nthread
do j = i+1, N
energy(i,j)
force(i,j)
end do
end do
Openmp reduciton
!$omp end parallel
MPI reduction (energy,force)
18. Pros and Cons of the Replicated data approach
1. Pros : easy to implement
2. Cons
• Parallel efficiency is not good
• No perfect load balance
• Communication cost is not reduced by increasing the number of
processes
• We can parallelize only for energy calculation (with MPI,
parallelization of integration is not so much efficient)
• Needs a lot of memory
• Usage of global data
19. Parallelization scheme 2: Domain decomposition
1. The simulation space is divided into
subdomains according to MPI
(different colors for different MPIs).
2. Each MPI only considers the
corresponding subdomain.
3. MPI communications only among
neighboring processes.
Communications
between processors
20. Parallelization scheme 2: Domain decomposition
1. For the interaction between
different subdomains, it is
necessary to have the data of the
buffer region of each subdomain.
2. The size of the buffer region is
dependent on the cutoff values.
3. The interaction between particles in
different subdomains should be
considered very carefully.
4. The size of the subdomain and
buffer region is decreased by
increasing the number of processes.
𝒄
Sub-
domain
Buffer
21. Pros and Cons of the domain decomposition approach
1. Pros
• Good parallel efficiency
• Reduced computational cost by increasing the number of processes
• We can easily parallelize not only energy but also integration
• Availability of a huge system
• Data size is decreased by increasing the number of processes
2. Cons
• Implementation is not easy
• Domain decomposition scheme is highly depend on the potential energy
type, cutoff and so on
• Good performance cannot be obtained for nonuniform particle distributions
22. Special treatment for nonuniform distributions
Tree method
: Subdomain size is adjusted to have the same number of particles
Hilbert space filling curve
: a map that relates multi-dimensional space to one-dimensional curve
The above figure is from Wikipedia
https://en.wikipedia.org/wiki/Hilbert_curve
23. Comparison of two parallelization scheme
Computation Communication Memory
Replicated data O(N/P) O(N) O(N)
Domain
decomposition
O(N/P) O((N/P)2/3) O(N/P)
N: system size
P: number of processes
25. Smooth particle mesh Ewald method
Real part Reciprocal part Self energy
The structure factor in the reciprocal part is approximated as
Using Cardinal B-splines of order n Fourier Transform of
charge
2 2 2
2 2
2
' , 1 1
'
erfc1 1 exp( / )
( ) ( )
2 2
i j c
N N
i j i j
i
i j ii j
r
q q
E S q
V
n k 0
r r n
r r n k
r k
kr r n
1 2 3 1 1 2 2 3 3 1 2 3( , , ) ( ) ( ) ( ) ( )( , , )S k k k b k b k b k F Q k k k
It is important to parallelize the Fast Fourier transform efficiently in PME!!
Ref : U. Essmann et al, J. Chem. Phys. 103, 8577 (1995)
26. Overall procedure of reciprocal space calculation
(charge on real space) (charge on grid)
E( )
Force calculation
from
PME_Pre
FFT
Inverse
FFT
PME_Post
28. Parallel 3D FFT – slab (1D) decomposition
1. Each process is assigned a slob of size
for computing FFT of
grids on processes.
2. The scalability is limited by
3. should be divisible by
Proc 1
Proc 2
Proc 3
Proc 4
29. Parallel 3D FFT – slab (1D) decomposition
No communication
MPI_Alltoall communication
among all preocessors
30. Parallel 3D FFT – slab (1D) decomposition (continued)
1. Slab decomposition of 3D FFT has three steps
• 2D FFT (or two 1D FFT) along the two local dimension
• Global transpose (communication)
• 1D FFT along third dimension
2. Pros
• fast using small number of processes
3. Cons
• Limitation of the number of processes
31. Parallel 3D FFT –2D decomposition
Each process is assigned a data of size for computing a FFT of
grids on processes.
MPI_Alltoall
communications among 3
processes of the same color
MPI_Alltoall
communications among 3
processes of the same color
32. Parallel 3D FFT –2D decomposition (continued)
1. 2D decomposition of 3D FFT has five steps
• 1D FFT along the local dimension
• Global transpose
• 1D FFT along the second dimension
• Global transpose
• 1D FFT along the third dimension
2. The global transpose requires communication only between subgroups of all nodes
3. Cons: Slower than 1D decomposition for a small number of processes
4. Pros : Maximum parallelization is increased
33. Pseudo code of reciprocal space calculation with
2D decomposition of 3D FFT
! compute Q factor
do i = 1, natom/P
compute Q_orig
end do
call mpi_alltoall(Q_orig, Q_new, …)
accumulate Q from Q_new
!FFT : F(Q)
do iz = 1, zgrid(local)
do iy = 1, ygrid(local)
work_local = Q(my_rank)
call fft(work_local)
Q(my_rank) = work_local
end do
end do
call mpi_alltoall(Q, Q_new,…)
do iz = 1, zgrid(local)
do ix = 1, xgrid(local)
work_local = Q_new(my_rank)
call fft(work_local)
Q(my_rank) = work_local
end do
end do
call mpi_alltoall(Q,Q_new,..)
do iy = 1, ygrid(local)
do ix = 1, xgrid(local)
work_local = Q(my_rank)
call fft(work_local)
Q(my_rank) = work_local
end do
end do
! compute energy and virial
do iz = 1, zgrid
do iy = 1, ygrid(local)
do ix = 1, xgrid(local)
energy = energy + sum(Th*Q)
virial = viral + ..
end do
end do
end do
! X=F_1(Th)*F_1(Q)
! FFT (F(X))
do iy = 1, ygrid(local)
do ix = 1, xgrid(local)
work_local = Q(my_rank)
call fft(work_local)
Q(my_rank) = work_local
end do
end do
call mpi_alltoall(Q,Q_new,..)
do iz = 1, zgrid(local)
do ix = 1, xgrid(local)
work_local = Q(my_rank)
call fft(work_local)
Q(my_rank) = work_local
end do
end do
call mpi_alltoall(Q,Q_new)
do iz = 1, zgrid(local)
do iy = 1, ygrid(local)
work_local = Q(my_rank)
call fft(work_local)
Q(my_rank) = work_local
end do
end do
!compute force
34. Parallel 3D FFT – 3D decomposition
• More communications than
existing FFT
• MPI_Alltoall communications
only in one dimensional space
• Reduce communicational cost
for large number of processes
Ref) J. Jung et al. Comput. Phys. Comm.
200, 57-65 (2016)
36. Domain decomposition in GENESIS SPDYN
1. The simulation space is divided into
subdomain according to the number
of MPI processes.
2. Each subdomain is further divided
into the unit domain (named cell).
3. Particle data are grouped together
in each cell.
4. Communications are considered
only between neighboring
processes.
5. In each subdomain, OpenMP
parallelization is used.
𝒄
37. Efficient shared memory calculation
=>Cell-wise particle data
1. Each cell contains an array with the data of the particles that reside within it
2. This improves the locality of the particle data for operations on individual cells or
pairs of cells.
particle data in traditional cell lists
Ref : P. Gonnet, JCC 33, 76-81 (2012)
cell-wise arrays of particle data
38. Midpoint cell method
• For every cell pair, midpoint cell is
decided.
• If midpoint cell is not uniquely
decided, we decide it by considering
load balance.
• For every particle pairs, interaction
subdomain is decided from the
midpoint cell.
• With this scheme, we can only
communicate data of each
subdomain only considering the
adjacent cells of each subdomain.
39. Subdomain structure with midpoint cell method
1. Each MPI has local subdomain which is surrounded by boundary cells.
2. At least, the local subdomain has at least two cells in each direction
Local sub-domain
Boundary
40. Communication pattern (example of coordinates)
1. Each subdomain can evaluate interactions by obtaining the data of boundary region from
other processes
2. Using the communication pattern shown below, we can minimize the frequency of
communication.
①②
③
④
① Receive from the upper process in x
dimension
② Receive from the lower process in x
dimension
③ Receive from the upper process in y
dimension
④ Receive from the lower process in y
dimension
41. Communication pattern (example of forces)
1. Communication pattern of forces is opposite to the communication pattern of coordinates.
2. Each process send the force data in boundary cells.
①
②
③ ④
① Send to the lower process in y
dimension
② Send to the upper process in y
dimension
③ Send to the lower process in x
dimension
④ Send to the upper process in x
dimension
42. OpenMP parallelization in GENESIS
1. In the case of integrator, every cell indices are divided according to the thread id.
2. As for the non-bond interaction, cell-pair lists are first identified and cell-pair lists are
distributed to each thread
1
5
2
13
9 10
6
3
14
4
7
11
8
12
1615
(1,2)
(1,5)
(1,6)
(2,3)
(2,5)
(2,6)
(2,7)
(3,4)
(3,6)
(3,7)
(3,8)
(4,7)
(4,8)
(5,6)
(5,9)
(5,10)
(6,7)
(6,10)
…
thread1
thread2
thread3
thread4
thread1
thread2
thread3
thread4
…
44. FFT in GENESIS (2 dimensional view)
GENESIS
(Identical domain
decomposition
between two space)
NAMD, Gromacs
(Different domain
decomposition)
45. Parallelization based on CPU/GPU calculation
Computationally expensive part: GPU
Communicationally expensive part: CPU
46. SIMD in GENESIS (developer version)
• Array of Structure (AoS)
x1 y1 z1 x2 y2 z2 xN yN zN
• Structure of Array (SoA)
In GENESIS, it is expressed as coord_pbc(1:3,1:natom,1:ncell)
In updated GENESIS source code for KNL, it will be expressed as
coord_pbc(1:natom,1:3,1:ncell)
x1 x2 x3 x4 zN-2 zN-1 zN
• For Haswell/Broadwell/KNL machines, SoA shows better performance
than AoS due to efficient vectorization (SIMD)
47. GENESIS developments on KNL
Simple pair-list scheme with better performance
• Pair-list scheme in GENESIS 1.X does not show good performance due to non-
contiguous memory access (Figure (a))
48. GENESIS developments on KNL
Memory
usage
Number of
interactions
CPU time
(pair-list)
CPU time
(non-bonded)
CPU time
(total)
Algorithm 1 (AoS) 305.88 MiB 47364770 9.95 36.42 46.37
Algorithm 1 (SoA) 305.88 MiB 47364770 6.60 33.71 40.31
Algorithm 2 (AoS) 48.4 MiB 122404436 4.91 102.46 107.37
Algorithm 2 (SoA) 48.4 MiB 122404436 4.15 27.86 32.01
• Algorithm 1: pair-list algorithm used in GENESIS 1.X
• Algorithm 2: New pair-list scheme
• Target system : ApoA1 (about 90,000 atoms, 2000 time steps)