SlideShare a Scribd company logo
1 of 13
Download to read offline
CODE GPU WITH CUDA
DEVICE CODE OPTIMIZATION PRINCIPLE
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
PREVIOUS
OUTLINE
Optimization principle
Performance limiters
Little’s law
TLP & ILP
DEVICE CODE OPTIMIZATION PRINCIPLE
Specific of SIMT architecture makes GPU to be latent at all, so
Hiding latency is the only GPU-specific optimization principle
Typical latencies for Kepler generation
register writeback: ~10 cycles
L1: ~34 cycles
Texture L1: ~96 cycles
L2: ~160 cycles
Global memory: ~350 cycles
PERFORMANCE LIMITERS
Optimize for GPU ≃ Optimize for latency
Factors that pervert latency hiding:
Insufficient parallelism
Inefficient memory accesses
Inefficient control flow
THROUGHPUT & LATENCY
Throughput
is how many operations are performed in one cycle
Latency
is how many cycles pipeline stalls before another dependent operation
Inventory
is a number of warps on fly i.e. in execution stage of the pipeline
LITTLE’S LAW
L = λ × W
Inventory (L) = Throughput (λ) × Latency (W)
Example:GPUwith8operationsperclockand18clocklatency
LITTLE’S LAW: FFMA EXAMPLE
Fermi GF100
Throughput: 32 operations per clock (1 warp)
Latency: ~18 clocks
Maximum resident warps per SM: 24
Inventory: 1 * 18 = 18 warps on fly
Kepler GK110
Throughput: 128 (if no ILP) operations per clock (4 warps)
Latency: ~10 clocks
Maximum resident warps per SM: 64
Inventory: 4 * 10 = 40 warps on fly
Maxwell GM204
Throughput: 128 operations per clock (4 warps)
Latency: ~6 clocks
Maximum resident warps per SM: 64
Inventory: 4 * 6 = 24 warps on fly
TLP & ILP
Thread Level Parallelism
enabling factors:
sufficient number of warps per SM on fly
limiting factors:
bad launch configuration
resource consuming kernels
poorly parallelized code
Instruction Level Parallelism
enabling factors:
independent instructions per warp
dual issue capabilities
limiting Factors:
structural hazards
data hazards
IMPROVING TLP
Occupancy
is actual number of warps running concurrently on a multiprocessor divided by
maximum number of warps that can be run concurrently by hardware
Improve occupancy to achieve better TLP
Modern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell)
blocks BUT you need recourses for them: registers, smem
Kepler has 64 K. × 32-bit registers and 32-lane wide warp
65536 registers / 64 warps / 32 warp_size = 32 registers/thread
IMPROVING ILP
Kernel unrolling: process more elements by thread, because operations on different
elements are independent
Device code compiler is not bad in instruction reordering
Loop unrolling in device code to increase number of independent operations
Other techniques used for increasing ILP on CPU are suitable
_ _ g l o b a l _ _ v o i d u n r o l l e d ( c o n s t f l o a t * i n , f l o a t * o u t )
{
c o n s t i n t t i d = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ;
c o n s t i n t t o t a l T h r a d s = b l o c k D i m . x * g r i d D i m . x ;
o u t [ t i d ] = p r o c e s s ( i n [ t i d ] ) ;
o u t [ t i d + t o t a l T h r a d s ] = p r o c e s s ( i n [ t i d + t o t a l T h r a d s ] ) ;
}
# p r a g m a u n r o l l C O N S T _ E X P R E S S I O N
f o r ( i n t i = 0 ; i < N _ I T E R A T I O N S ; i + + ) { / * . . . * / }
ILP ON MODERN GPUS
ILP is a mast-have for older architectures, but still help to hide pipeline latencies on
modern GPUs
Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warps
each clock. Compute cores utilization: 1.0
Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warps
each clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization:
0.6(6)
Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warps
each 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization:
0.6(6)
FINAL WORDS
GPU optimization principles:
Principle #1: hide latency
Principle #2: see principle #1
THE END
NEXT
BY / 2013–2015CUDA.GEEK

More Related Content

What's hot

Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
Linaro
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
Netronome
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
elliando dias
 

What's hot (20)

Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
Debug generic process
Debug generic processDebug generic process
Debug generic process
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
 
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 

Similar to Code GPU with CUDA - Device code optimization principle

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 

Similar to Code GPU with CUDA - Device code optimization principle (20)

Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
HD5870 Matrix Factory
HD5870 Matrix FactoryHD5870 Matrix Factory
HD5870 Matrix Factory
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
Demystifying Datacenter Clos
Demystifying Datacenter ClosDemystifying Datacenter Clos
Demystifying Datacenter Clos
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Code GPU with CUDA - Device code optimization principle

  • 1. CODE GPU WITH CUDA DEVICE CODE OPTIMIZATION PRINCIPLE CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  • 3. DEVICE CODE OPTIMIZATION PRINCIPLE Specific of SIMT architecture makes GPU to be latent at all, so Hiding latency is the only GPU-specific optimization principle Typical latencies for Kepler generation register writeback: ~10 cycles L1: ~34 cycles Texture L1: ~96 cycles L2: ~160 cycles Global memory: ~350 cycles
  • 4. PERFORMANCE LIMITERS Optimize for GPU ≃ Optimize for latency Factors that pervert latency hiding: Insufficient parallelism Inefficient memory accesses Inefficient control flow
  • 5. THROUGHPUT & LATENCY Throughput is how many operations are performed in one cycle Latency is how many cycles pipeline stalls before another dependent operation Inventory is a number of warps on fly i.e. in execution stage of the pipeline
  • 6. LITTLE’S LAW L = λ × W Inventory (L) = Throughput (λ) × Latency (W) Example:GPUwith8operationsperclockand18clocklatency
  • 7. LITTLE’S LAW: FFMA EXAMPLE Fermi GF100 Throughput: 32 operations per clock (1 warp) Latency: ~18 clocks Maximum resident warps per SM: 24 Inventory: 1 * 18 = 18 warps on fly Kepler GK110 Throughput: 128 (if no ILP) operations per clock (4 warps) Latency: ~10 clocks Maximum resident warps per SM: 64 Inventory: 4 * 10 = 40 warps on fly Maxwell GM204 Throughput: 128 operations per clock (4 warps) Latency: ~6 clocks Maximum resident warps per SM: 64 Inventory: 4 * 6 = 24 warps on fly
  • 8. TLP & ILP Thread Level Parallelism enabling factors: sufficient number of warps per SM on fly limiting factors: bad launch configuration resource consuming kernels poorly parallelized code Instruction Level Parallelism enabling factors: independent instructions per warp dual issue capabilities limiting Factors: structural hazards data hazards
  • 9. IMPROVING TLP Occupancy is actual number of warps running concurrently on a multiprocessor divided by maximum number of warps that can be run concurrently by hardware Improve occupancy to achieve better TLP Modern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell) blocks BUT you need recourses for them: registers, smem Kepler has 64 K. × 32-bit registers and 32-lane wide warp 65536 registers / 64 warps / 32 warp_size = 32 registers/thread
  • 10. IMPROVING ILP Kernel unrolling: process more elements by thread, because operations on different elements are independent Device code compiler is not bad in instruction reordering Loop unrolling in device code to increase number of independent operations Other techniques used for increasing ILP on CPU are suitable _ _ g l o b a l _ _ v o i d u n r o l l e d ( c o n s t f l o a t * i n , f l o a t * o u t ) { c o n s t i n t t i d = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t t o t a l T h r a d s = b l o c k D i m . x * g r i d D i m . x ; o u t [ t i d ] = p r o c e s s ( i n [ t i d ] ) ; o u t [ t i d + t o t a l T h r a d s ] = p r o c e s s ( i n [ t i d + t o t a l T h r a d s ] ) ; } # p r a g m a u n r o l l C O N S T _ E X P R E S S I O N f o r ( i n t i = 0 ; i < N _ I T E R A T I O N S ; i + + ) { / * . . . * / }
  • 11. ILP ON MODERN GPUS ILP is a mast-have for older architectures, but still help to hide pipeline latencies on modern GPUs Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warps each clock. Compute cores utilization: 1.0 Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warps each clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization: 0.6(6) Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warps each 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization: 0.6(6)
  • 12. FINAL WORDS GPU optimization principles: Principle #1: hide latency Principle #2: see principle #1
  • 13. THE END NEXT BY / 2013–2015CUDA.GEEK