Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

798 views

Published on

Organizers

Daniele Loiacono, Politecnico di Milano

Antonino Tumeo, Pacific Northwest National Laboratory

Webpage

http://gpu.geccocompetitions.com

No Downloads

Total views

798

On SlideShare

0

From Embeds

0

Number of Embeds

78

Shares

0

Downloads

7

Comments

0

Likes

1

No embeds

No notes for slide

- 1. GECCO 2013 GPUs for GEC GECCO 2013 GPUs for Genetic and Evolutionary Computation Competition Daniele Loiacono and Antonino Tumeo
- 2. GECCO 2013 GPUs for GEC Why GPUs? ! The GPU has evolved into a very flexible and powerful processor: " It’s programmable using high-level languages " Now supports 32-bit and 64-bit floating point IEEE-754 precision " It offers lots of GFLOPS ! GPU in every PC and workstation
- 3. GECCO 2013 GPUs for GEC ! Goal " Attract the applications of genetic and evolutionary computation that can maximally exploit the parallelism provided by low-cost consumer graphical cards. ! Evaluation " 50% – Quality and Performance " 30% - Relevance for EC community " 20% – Novelty ! Panel … and myself This competition… Simon Harding El-Ghazali TalbiAntonino TumeoJaume Bacardit
- 4. GECCO 2013 GPUs for GEC Entries “Fast QAP Solver with ACO and Taboo Search on Multiple GPUs with the Move-Cost Adjusted Thread Assignment”. Shigeyoshi Tsutsui and Noriyuki Fujimoto “GPOCL: A Massively Parallel GP Implementation in OpenCL” Douglas A. Augusto Helio J.C. Barbosa
- 5. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Fast QAP Solver with ACO and Taboo Search on Multiple GPUs with the Move- Cost Adjusted Thread Assignment
- 6. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Quadratic Assignment Problem (QAP) • One of the hardest combinatorial optimization problem – There are many real-world applications: • Optimum location allocation of factories in a multinational company • Optimum section allocation in a big building • … • Definition: – Given n locations and n facilities, the task is to assign the facilities to the locations to minimize the cost • aij is the distance matrix for each pair of locations i and j • bij is the flow matrix for each pair of facilities i and j ∑∑= = = n i ji n j ijbaf 1 )()( 1 )( φφφ
- 7. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Initialize Pheromone density τ" Update pheromone density τ" Construct solutions based on τ" Apply local search (Tabu search) start end ACO+TS on a Single GPU Pheromone Density Matrix τ" Initialize Pheromone density τ" Construct solutions based on τ" Apply local search (Tabu search) Update pheromone density τ" Terminate?Terminate? Instances Construction of solusions TS Updating Trail tai40a 0.007% 99.992% 0.001% tai50a 0.005% 99.994% 0.000% tai60a 0.004% 99.996% 0.000% tai80a 0.002% 99.997% 0.000% tai100a 0.002% 99.998% 0.000% tai50b 0.022% 99.976% 0.002% tai60b 0.017% 99.982% 0.001% tai80b 0.011% 99.988% 0.001% tai100b 0.008% 99.991% 0.000% tai150b 0.005% 99.995% 0.000% Time distribution in sequential run on CPU • We combined ACO and Taboo Search (TS)
- 8. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% • A neighbor φ ’ of φ in QAP • Neighborhood size of N(φ) is Nsize=n(n-1)/2 • To choose the best φ’, we need to calculate costs for all of Nsize neighbors 2 1 0 3φ Neighborhood in the QAP 0 1 2 3φ’ swap
- 9. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Computation Cost of a Neighboring Solution • Fast update [Taillard 04]: – if we have memory of Δ(φ, r, s) for all pairs r, s, – and {u, v} ∩ {r, s}= satisfies, Δ(φ’, u, v) can be obtained as: ∑ − ≠= $ $ % & ' ' ( ) −+− +−+− +−+− +−+−=Δ 1 ,,0 )()()()()()()()( )()()()()()()()( )()()()()()()()( )()()()()()()()( )()( )()( )()( )()(),,( n srkk kskrskkrksrk skrkksrkskkr ssrrssrssrsr srrsrsrrssrr bbabba bbabba bbabba bbabbasr φφφφφφφφ φφφφφφφφ φφφφφφφφ φφφφφφφφφ )(nO ))(( ))(( ),,(),,'( )(')(')(')(')(')(')(')(' )(')(')(')(')(')(')(')(' rurvsvsuusvsvrur urvrvsussusvrvru bbbbaaaa bbbbaaaa vuvu φφφφφφφφ φφφφφφφφ φφ −+−−+− +−+−−+− +Δ=Δ )1(O • Let φ’ be a neighbor of φ obtained by exchanging r-th and s-th elements of φ, then move cost Δ(φ, r, s)=f(φ’) - f(φ) can be obtained as
- 10. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Parallel computation of move cost -The simplest threads assignment- threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 . . . . . threadIdx.x=Nsize-1 blockIdx.x=0 Assign m agents to blocks Assignmovecalculationstothreads blockIdx.x=1 blockIdx.x= m-1 Nsize=n(n-1)/2 threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 . . . . . threadIdx.x=Nsize-1 threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 . . . . . threadIdx.x=Nsize-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 0 2 1 2 3 3 4 5 4 6 7 8 9 5 10 11 12 13 14 6 15 16 17 18 19 20 7 21 22 23 24 25 26 27 8 28 29 30 31 32 33 34 35 9 36 37 38 39 40 41 42 43 44 10 45 46 47 48 49 50 51 52 53 54 11 55 56 57 58 59 60 61 62 63 64 65 12 66 67 68 69 70 71 72 73 74 75 76 77 13 78 79 80 81 82 83 84 85 86 87 88 89 90 u v
- 11. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Move-Cost Adjusted Thread Assignment (MATA) Computational time warp 0 warp 1 0 1 2 3 15 16 30 31 32 33 3232 thread index Computational time No branch divergence in each warp ! 0 1 2 3 4 5 6 28 29 30 31 warp 0 thread index 32 33 34 35 36 37 38 60 61 62 63 warp 1 O(1) O(n) Delay Caused by Branch Divergence
- 12. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Speedup on a Single GTX480 tai50a tai60a tai80a tai100a tai50b tai60b tai80b tai100b Average 0 5 10 15 20 25 30 35 40 Speedup 3.7 26.1 3.4 27.7 4.3 20.3 3.4 18.3 3.9 24.9 4.6 35.5 4.2 21.4 5.4 29.5 4.1 25.5 CPU: i7 965, 3.2GHz QAP Instances
- 13. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Implementation on Multiple GPUs CPU ACO0 ACO2 ACO1ACO3 CPU work memory : solutions GPU0 GPU1 GPU2 GPU3 ACO3 ACO0 ACO1
- 14. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% 4 Types of Island Models • We implemented following 4 types of island models 1. IM-INDP: Island model with independent runs 2. IM-ELIT: Island model with elitist 3. IM-RING: Island model with ring connected 4. IM-ELMR: Island model with elitist and massive ring connected
- 15. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% IM-INDP: Island model with independent runs CPU ACO0 ACO1 ACO3 ACO2
- 16. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% IM-ELIT: Island model with elitist worst guy best guy global best guy ACO0 ACO1 ACO3 ACO2
- 17. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% IM-RING: Island model with ring connected worst guy best guy ACO1 ACO2 ACO3 ACO0
- 18. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% IM-ELMR: Island model with elitist and massive ring connected CPU IM-ELIT + ACO1 ACO2 ACO3 ACO0
- 19. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Results of Island Models with 4 GPUstai50a tai60a tai80a tai100a tai50b tai60b tai80b tai100b Average 0 1 2 3 4 5 6 7 Speedup 2.1 2.6 2.9 3.3 1.9 2.22.3 2.5 2.42.5 2.7 2.9 1.7 2.12.2 2.5 1.5 2.3 2.5 3 1.2 1.4 1.9 2.6 1.5 4.7 4.3 6.5 1.4 2.32.3 3.2 1.7 2.52.6 3.3
- 20. GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary% Conclusion • On a single GPU with “MATA” – 25.5 times speedup to CPU (i7 965, 3.2GHz) • On 4-GPU (GTX480) – IM_ELMR model has 3.3 times speedup to single GPU • As a result, 25.5×3.3 = 84.2 times speedup compared with the CPU computation
- 21. GPOCL: A Massively Parallel GP Implementation in OpenCL Douglas A. Augusto Helio J.C. Barbosa douglas@lncc.br hcbm@lncc.br Laborat´orio Nacional de Computa¸c˜ao Cient´ıﬁca (LNCC) Rio de Janeiro, Brazil
- 22. GPOCL’s Features 2 / 12 n Fast and e cient C/C++ implementation based on a compact linear tree representation. n Massively parallel tree interpretation using OpenCL. n It can be executed on virtually any parallel device, comprising dif- ferent architectures and vendors. n It implements three di↵erent parallel strategies (ﬁtness-based, population-based, and a mixture of both). n To improve diversity it can evolve loosely-coupled subpopulations (neighborhoods). n It has a rich set of command-line options, including primitives’ set deﬁnition, probabilities of the genetic operators, stopping crite- ria, minimum and maximum tree sizes, and the conﬁguration of neighborhoods. n It is Free Software (http://gpocl.sf.net).
- 23. Open Computing Language (OpenCL) 3 / 12 n Open Computing Language, or simply OpenCL, is an open- standard programming language for heterogeneous parallel com- puting.1 n It aims at e ciently exploiting the computing power of all process- ing devices, such as traditional processors (CPU) and accelerators (GPU, FPGA, DSP, Intel’s MIC, and so forth). n It provides a uniform programming interface, which saves the pro- grammer from writing di↵erent codes in di↵erent languages when targeting multiple compute architectures, thus providing portabil- ity. n It is very ﬂexible (low-level language). 1 http://www.khronos.org
- 24. GPOCL 4 / 12 GPOCL implements a GP system using a preﬁx linear tree represen- tation. Its main routine performs the following high-level procedures: 1. OpenCL initialization: This is the step where the general OpenCL-related tasks are initialized. 2. Calculating n-dimensional ranges: Deﬁnes how much paral- lel work there will be and how they are distributed among the compute units. 3. Memory bu↵ers creation: In this phase all global memory re- gions accessed by the OpenCL kernels are allocated on the device and possibly initialized. The ﬁtness cases are transferred and enough space is reserved for the population and error vectors. 4. Kernel building: An OpenCL kernel, relative to a given strategy of parallelization, is compiled just-in-time, targeting the compute device. 5. Evolving: This iterative routine implements the actual genetic programming dynamics.
- 25. Main Evolutionary Algorithm 5 / 12 Create (randomly) the initial population P; 22 Evaluate(P); for generation 1 to NG do Copy the best (elitism) programs of P to the temporary population Ptmp; while |Ptmp| < |P| do Select and copy from P two ﬁt programs, p1 and p2; if [probabilistically] crossover then Recombine p1 and p2, generating p0 1 and p0 2; p1 p0 1; p2 p0 2; end if [probabilistically] mutation then Apply mutation in p1 and p2, creating p0 1 and p0 2; p1 p0 1; p2 p0 2; end Insert p1 and p2 into Ptmp; end P Ptmp; then reset Ptmp; 1818 Evaluate(P); end return the best program found;
- 26. Evaluate(P) 6 / 12 The evaluation step itself does not do much—the hard work is done mostly by the OpenCL kernels. Basically, three things happen within Evaluate(P): 1. Population transfer: All programs of P are transferred to the target compute device. 2. Kernel execution: For any non-trivial problem, this is the most demanding phase. Here, the entire recently transferred popula- tion is evaluated—by interpreting each program over each ﬁtness case—on the compute device. Fortunately, this step can be done both in parallel as well accelerated by GPUs. 3. Error retrieval: After being computed and accumulated in the previous step, the population’s prediction errors need to be trans- ferred to the host so that this information is available to the evolutionary process.
- 27. Overall Best Parallelization Strategy 7 / 12 n The population of programs and ﬁtness cases are parallelized. n A mixture of the ﬁtness- and population-based strategies. n While di↵erent programs are evaluated simultaneously on di↵erent compute units (CU), the processing elements (PE) within each CU take care, in parallel, of the whole training data set. n Since internally to each CU the PEs will be interpreting the same program, the event of instruction divergence is unlikely.
- 28. Some benchmarks on a NVIDIA GTX-285 GPU An old generation GPU (released in early 2009) 8 / 12
- 29. Fitness-based Parallelization Strategy 9 / 12 100 1000 5000 10000 25000 50000 1000 5000 10000 25000 50000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000 BillionGPop/s Population size Data set size BillionGPop/s 9.540 Billion GPop/s (good performance, but requires a lot of ﬁtness cases)
- 30. Population-based Parallelization Strategy 10 / 12 100 1000 5000 10000 25000 50000 1000 5000 10000 25000 50000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 BillionGPop/s Population size Data set size BillionGPop/s 0.690 Billion GPop/s (bad performance, causes a lot of instruction divergence)
- 31. Combined Fitness- and Population-based Parallelization Strategy 11 / 12 100 1000 5000 10000 25000 50000 1000 5000 10000 25000 50000 7.000 8.000 9.000 10.000 11.000 12.000 BillionGPop/s Population size Data set size BillionGPop/s 11.85 Billion GPop/s
- 32. 12 / 12 Thank you!
- 33. GECCO 2013 GPUs for GEC Shigeyoshi Tsutsui, Hannan University and Noriyuki Fujimoto, Osaka Prefecture University Fast QAP Solver with ACO and Taboo Search on Multiple GPUs with the Move-Cost Adjusted Thread Assignment And the winner is....

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment