SlideShare a Scribd company logo
1 of 50
Download to read offline
The Best ProgrammingThe Best Programming
Practice for Cell/B.E.Practice for Cell/B.E.
20020099..1212..1111
Sony Computer Entertainment Inc.Sony Computer Entertainment Inc.
Akira TsukamotoAkira Tsukamoto
Who am I?Who am I?
Cell Broadband Engine
(Cell/B.E.)
1 PPE: general-purpose
PowerPC architecture
System management
8 SPEs: between general-purpose and special-
purpose
SIMD capable RISC architecture
128 bit (16 byte) register size
128 registers
Programs and data must be located on 256KiB local
storage
External data access by DMA via MFC
Workloads for game, media, etc
9 Cores (1 PPE + 8 SPE)
Peak Performance
Over 200 GFlops (Single Precision)
4 (32-bit SIMD) * 2 (FMA) Flop
* 8 SPE * 3.2GHz
= 204.8 GFlops per socket
Over 20 GFlops (Double Precision)
Up to 25.6 GB/s Memory B/W
35 GB/s (out) + 25 GB/s (in) I/O B/W
Over 200 GB/s Total Interconnect B/W
96B/cycle
Performance of Cell/B.E.
Processor
Overview of Cell/B.E.
PPE
L2
Cache
(512KB)
MICXDR
RAM
PPU
L1
I/O Controller
SPE
SPE
SPE
SPE
SPE
SPE
SPE
EIB(ElementInterconnectBus)
SPE
Dual XDR®
Memory
Channel
XDR
RAM
SPE: Simple 128-bit SIMD
Processor
• For Efficient Data Processing
• Controlled by PPE programs
• New Architecture – Binaries
Compiled for SPU run on it
SPE: Simple 128-bit SIMD
Processor
• For Efficient Data Processing
• Controlled by PPE programs
• New Architecture – Binaries
Compiled for SPU run on it
PPE: General-Purpose
64-bit PPC Processor
• For Operating System
and Program Control
• PowerPC Compliant –
Linux and PPC Linux
Apps run on it
PPE: General-Purpose
64-bit PPC Processor
• For Operating System
and Program Control
• PowerPC Compliant –
Linux and PPC Linux
Apps run on it
SPULSMFC
Local Store (LS):
256KB Local Memory
for Instruction and Data
(Not a Cache: no trans-
parent mechanism to
fetch data from main
memory)
Local Store (LS):
256KB Local Memory
for Instruction and Data
(Not a Cache: no trans-
parent mechanism to
fetch data from main
memory)
EIB: High-
Speed
Internal Bus
(<200GB/s)
EIB: High-
Speed
Internal Bus
(<200GB/s)
MFC: Interface
to Other
Components
MFC: Interface
to Other
Components
The Future of Cell/B.E.
PowerXCell32i (Quad Cell/B.E.) was canceled
BUT
Current Cell/B.E. production will continue
PlayStation®3 (PS3®), IBM QS22 (CellBlade)
SPE architecture will be incorporated to Power CPU
Cell/B.E. Programming skills are
beneficial to achieve good
performance on the future computer
architecture
One big misunderstanding about
Cell/B.E.
Is SPE one kind of Floating Point Unit (FPU) or
Digital Signaling Processor (DSP)?
NO
SPE is one kind of regular CPU core but equipped with
optimized Single Instruction Multiple Data (SIMD)
operations.
Could run program and process data in their memory
called Local Storage (LS) as normal CPU does.
SPE vs. GPGPU
SPE
Very good performance of general instructions
if(), switch(), for(), while() are fast in C/C++ language
Capable for different processing in parallel (Task parallel model)
2 SPEs for Physics engine, 2 SPEs for vision recognition, 2 SPEs for codec
GPGPU
Limited performance on general instructions
Not good for different processing in parallel (Task parallel model)
Suitable for processing large data with the same calculation (Data parallel
model)
SPE is better for general purpose processing to adopt wide range
of programming
Cell/B.E. Programming
Environment
PPE toolchain
One of PowerPC targets
gcc and binutils with Cell/B.E. specific enhancements
SPE toolchain
New target architecture
spu-gcc, binutils, newlib (libc), ...
libspe
SPE management library
Provides OS-independent API by wrapping the raw SPUFS
interfaces
MARS
Provides effective SPE runtime environment
Cell/B.E. Programming
Environment
GCC, BINUTILS, GDB,
basic SPU libs (libspe2)
PS3 platform
support
Hypervisor
PS3®
QS20, 21, 22
(Cell Blade)
platform support
QS20, 21, 22
Toshiba Ref. Set
platform
support
Toshiba Ref. Set
IBMTOSHIBA Sony/SCEI
SPU Runtime libs (MARS)
SPU middleware, SPU programs
- Various Codecs, Various Physics,
- Face detection,
- Various motion detections, ….
Programming Tools
Common Linux Kernel Infrastructure (SPUFS)
ReusableReusable
ZEGO
platform
support
ZEGO®
Hello World Programming on
Cell/B.E.
SPE programming
SPE program prints “Hello World!”
SPE program prints “Hello World!”
#include <stdio.h>
int main()
{
printf(“Hello, World!¥n”);
return 0;
}
$ spu-gcc hello.c -o hello.spe
$ ./hello.spe
Program execution flow on SPE
Optimizing SPE program
Regular programming on SPE do not achieve
Over 200 GFlops performance
Requires optimization on SPE programming
Optimization Technique on
Cell/B.E.
Use SIMD Instructions
vector type extension on spu-gcc
Two double-precision floating-point data__vector double
Four single-precision floating-point data__vector float
Two signed 64-bit data__vector signed long long
Two unsigned 64-bit data__vector unsigned long long
Four signed 32-bit data__vector signed int
Four unsigned 32-bit data__vector unsigned int
Eight signed 16-bit data__vector signed short
Eight unsigned 16-bit data__vector unsigned short
Sixteen signed 8-bit data__vector signed char
Sixteen unsigned 8-bit data__vector unsigned char
DataVector Type
vector type extension on spu-gcc
SIMD programming
float a[4], b[4], c[4];
for (i = 0; i < 4; i++) {
c[i] = a[i] * b[i];
}
__vector float va, vb, vc;
vc = spu_mul(va, vb);
Other SIMD Built-in Functions
Finds the bitwise logical sums (OR)
between vectors a and b.
spu_or(a,b)vec_or(a,b)
Finds the bitwise logical products (AND)
between vectors a and b.
spu_and(a,b)vec_and(a,b)Logical
Instruction
s
Calculates the square roots of the
reciprocals of the elements of vector a.
spu_rsqrte(a)vec_rsqrte(a)
Calculates the reciprocals of the elements
of vector a.
spu_re(a,b)vec_re(a,b)
Multiplies the elements of vector a by the
elements of vector b and adds the
elements of vector c.
spu_madd(a,b,c)vec_madd(a,b,c)
Performs subtractions between the
elements of vectors a and b.
spu _sub(a,b)vec_sub(a,b)
Adds the elements of vectors a and b.spu_add(a,b)vec_add(a,b)Arithmetic
Instruction
s
DescriptionSPU SIMDVMXApplicable
Instructions
Use Double Buffering DMA
between Main memory and LS
DMA between Main memory and
LS
Main
Storage
The EA space is
shared with SPE
The EA space is
shared with SPE
SPE
SPU LS
MFC
DMAC in the MFC is
responsible to transfer the
data
DMAC in the MFC is
responsible to transfer the
data
GET or PUT
Address (EA)
Size
Please
transfer
data!
Single buffering and double
buffering
SPU
DMA request
WaitingSPU
MFC
DMA request
ProcessingWaiting
DMA request
Waiting
DMA transfer
Processing
DMA transfer
Single buffering
処理
Waiting
MFC
DMA request
処理
DMA request
処理
DMA request
Processing Processing
Double buffering
Processing
DMA request
DMA transfer DMA transfer DMA transfer DMA transfer
Use Aligned Data
How to Align Data
128 byte aligned data is best for DMA
16 byte aligned data is best for SPE instructions
Use gcc’s aligned attribute for static or global data
Example: 16-bytes aligned integer variable
Example: defining a 128-bytes-aligned structure type
Use posix_memalign for dynamic allocation
__attribute__((aligned(align_size)))
int a __attribute__((aligned(16)));
typedef struct { int a; char b; }
__attribute__((aligned(128))) aligned_struct_t;
#define _XOPEN_SOURCE 600 /* include POSIX 6th definition */
#include <stdlib.h>
int posix_memalign(void **ptr, size_t 16, size_t size);
Use Loop Unrolling
Unroll for loop
SPE has 128 entries of registers
for (i = 0; i < N; i += 16) {
vec_float4 av0 = *(vec_float4*)(a + i);
vec_float4 bv0 = *(vec_float4*)(b + i);
vec_float4 av1 = *(vec_float4*)(a + i + 4);
...
vec_float4 cv0 = av0 * bv0;
vec_float4 cv1 = av1 * bv1;
...
*(vec_float4*)(c + i) = cv0;
*(vec_float4*)(c + i + 4) = cv1;
...
}
for (i = 0; i < N; i += 4) {
*(vec_float4*)&c[i] = *(vec_float4*)&a[i] *
*(vec_float4*)&b[i]);
}
Compute on registers
Load the input
to registers
Effective Programming model
of Cell/B.E.
Typical Cell/B.E. Program
1 PPE program
User interface
Data input/output
Loading and executing SPE programs
Multiple SPE programs
Image processing
Physics simulation
Scientific calculation
PPE Centric Programming Model
PPE is responsible for:
Loading/switching of SPE programs
Sending/receiving of necessary data to its SPE programs
Problems of PPE Centric
Programming
Difficult for the PPE to know SPE's status
Stalls, waits...
Inefficient scheduling of SPE programs
Extra load of the PPE
Communication
Scheduling
Preparation for MARSPreparation for MARS
MultiMulti--core Application Runtime Systemcore Application Runtime System
What is MARS?What is MARS?
MARS
MARS stands for Multi-core Application Runtime
System
Provides efficient runtime environment for SPE centric
application programs
SPE Centric Programming Model
The individual SPEs are responsible for:
Loading, executing and switching SPE programs
Sending/receiving data between SPEs
What MARS Provides
PPE Centric Programming model is slow
Use PPE as less as possible
MARS provides SPE centric runtime without
complicate programming:
Scheduling workloads by SPEs
Lightweight context switching
Synchronization objects cooperating with the scheduler
MARS Advantages
Simplifies maximizing SPE usage
Efficient context switching
Minimizes data exchanged with PPE
Minimizes runtime load of the PPE
MARS Task Sync Objects
Semaphores, event flags, queues...
Waiting condition results in a task switch
Avoiding wasting time just on waiting
Programming MARSProgramming MARS
Typical Usage Scenario
1. PPE creates MARS context
2. PPE creates task objects
3. PPE creates synchronization objects
4. PPE starts the initial tasks
5. The existing tasks start additional tasks
6. The tasks do application specific works
7. PPE waits for tasks
8. PPE destroys task objects and sync objects
9. PPE destroys MARS context
10 int main(void)
11 {
12 struct mars_context *mars_ctx;
13 struct mars_task_id task1_id;
14 static struct mars_task_id task2_id[NUM_SUB_TASKS] __attribute__((aligned(16)));
15 struct mars_task_args task_args;
16 int i;
17
18 mars_context_create(&mars_ctx, 0, 0);
19
20 mars_task_create(mars_ctx, &task1_id, "Task 1", spe_main_prog.elf_image,
21 MARS_TASK_CONTEXT_SAVE_ALL);
22
23 for (i = 0; i < NUM_SUB_TASKS; i++) {
24 char name[16];
25 sprintf(name, "Task 2.%d", i);
26 mars_task_create(mars_ctx, &task2_id[i], name, spe_calc_prog.elf_image,
27 MARS_TASK_CONTEXT_SAVE_ALL);
28 }
29
30 task_args.type.u64[0] = mars_ptr_to_ea(&task2_id[0]);
31 task_args.type.u64[1] = mars_ptr_to_ea(&task2_id[1]);
32
33 /* start main SPE MARS task */
34 mars_task_schedule(&task1_id, &task_args, 0);
35
36 mars_task_wait(&task1_id, NULL);
37 mars_task_destroy(&task1_id);
38
39 for (i = 0; i < NUM_SUB_TASKS; i++)
40 mars_task_destroy(&task2_id[i]);
41
42 mars_context_destroy(mars_ctx);
43
44 return 0;
45 }
Preparation program on PPE
1 #include <stdlib.h>
2 #include <spu_mfcio.h>
3 #include <mars/mars.h>
4
5 #define DMA_TAG 0
6
7 int mars_task_main(const struct mars_task_args *task_args)
8 {
9 static struct mars_task_id task2_0_id __attribute__((aligned(16)));
10 static struct mars_task_id task2_1_id __attribute__((aligned(16)));
11 struct mars_task_args args;
12
13 mfc_get(&task2_0_id, task_args->type.u64[0], sizeof(task2_0_id), DMA_TAG, 0, 0);
14 mfc_get(&task2_1_id, task_args->type.u64[1], sizeof(task2_1_id), DMA_TAG, 0, 0);
15 mfc_write_tag_mask(1 << DMA_TAG);
16 mfc_read_tag_status_all();
17
18 /* start calculation SPE MARS task 0 */
19 args.type.u32[0] = 123;
20 mars_task_schedule(&task2_0_id, &args, 0);
21
22 /* start calculation SPE MARS task 1 */
23 args.type.u32[0] = 321;
24 mars_task_schedule(&task2_1_id, &args, 0);
25
26 mars_task_wait(&task2_0_id, NULL);
27 mars_task_wait(&task2_1_id, NULL);
28
29 return 0;
30 }
Main MARS task program on
SPE
1 #include <stdio.h>
2 #include <mars/mars.h>
3
4 int mars_task_main(const struct mars_task_args *task_args)
5 {
6
7 /* do some calculations here */
8
9 printf("MPU(%d): %s - Hello! (%d)¥n",
10 mars_task_get_kernel_id(), mars_task_get_name(),
11 task_args->type.u32[0]);
12
13 return 0;
14 }
Program for processing on SPE
MARS synchronization API
Barrier
This is used to make multiple MARS tasks wait at a certain point in a
program and to resume the task execution when all tasks are ready.
Event Flag
This is used to send event notifications between MARS tasks or between
MARS tasks and host programs.
Queue
This is used to provide a FIFO queue mechanism for data transfer between
MARS tasks or between MARS tasks and host programs.
Semaphore
This is used to limit the number of concurrent accesses to shared resources
among MARS tasks.
Task Signal
This is used to signal a MARS task in the waiting state to change state so
that it can be scheduled to continue execution.
Benchmark on MARS
Sample Application: OpenSSL
OpenSSL Application
HDD
OpenSSL libcrypto
MARS
OpenSSL Engine
PPE part
OpenSSL Engine API
OpenSSL API
PPE
SPEs
MARS
OpenSSL
Engine
task
MARS Kernel
Input
MARS Queue
Output
MARS Queue
Benchmarking: OpenSSL
PPE vs SPE
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
seconds
# of simultaneous tasks (# of input streams)
PPE vs libspe2 (calculating SHA256 of a 4GB+ file)
PPE real
PPE user
PPE sys
libspe2 real
libspe2 user
libspe2 sys
Benchmarking: OpenSSL
libspe2 vs MARS
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
seconds
# of simultaneous tasks (# of input streams)
libspe2 vs MARS (calculating SHA256 of a 4GB+ file)
libspe2 real
libspe2 user
libspe2 sys
MARS real
MARS user
MARS sys
libspe2 real : +
libspe2 user: ■
libspe2 sys : ○
MARS real : +
MARS user : ■
MARS sys : ○
How to Approach toHow to Approach to
Cell/B.E. TechnicalCell/B.E. Technical
InformationInformation
Information on Cell/B.E.
programming
Cell/B.E. programming document
http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-
docs/ps3-linux-docs-08.06.09/CellProgrammingPrimer.html
PS3 Linux Public Information
http://www.playstation.com/ps3-openplatform/index.html
http://www.kernel.org/pub/linux/kernel/people/geoff/cell/
Cell/B.E. information by IBM
http://www.ibm.com/developerworks/power/cell/documents.html
http://www.bsc.es/projects/deepcomputing/linuxoncell/
Cell/B.E. Discussion Mailing List:
cbe-oss-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/cbe-oss-dev
Cell/B.E. Discussion IRC:
#cell at irc.freenode.org
MARS Releases, Source Code, Samples
http://ftp.uk.linux.org/pub/linux/Sony-PS3/mars/
MARS Development Repositories:
git://git.infradead.org/ps3/mars-src.git
http://git.infradead.org/ps3/mars-src.git

More Related Content

What's hot

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesSlide_N
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Glossary of terms
Glossary of terms Glossary of terms
Glossary of terms crimzon36
 
09 accelerators
09 accelerators09 accelerators
09 acceleratorsMurali M
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldAllan Cantle
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPUJino Antony
 
Cartilla en ingles
Cartilla en inglesCartilla en ingles
Cartilla en inglesanderson
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game EngineJohan Andersson
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA accelerationMarco77328
 
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...Vaibhav R
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 

What's hot (20)

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 
A
AA
A
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Glossary of terms
Glossary of terms Glossary of terms
Glossary of terms
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
Cartilla en ingles
Cartilla en inglesCartilla en ingles
Cartilla en ingles
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Gpu
GpuGpu
Gpu
 

Viewers also liked

Viewers also liked (18)

Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
 
PlayStation®3 Leads Stereoscopic 3D Entertainment World
PlayStation®3 Leads Stereoscopic 3D Entertainment World PlayStation®3 Leads Stereoscopic 3D Entertainment World
PlayStation®3 Leads Stereoscopic 3D Entertainment World
 
Optimization for Making Stereoscopic 3D Games on PlayStation® (PS3™)
Optimization for Making Stereoscopic 3D Games on PlayStation® (PS3™)Optimization for Making Stereoscopic 3D Games on PlayStation® (PS3™)
Optimization for Making Stereoscopic 3D Games on PlayStation® (PS3™)
 
Sony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development Division
 
José albín
José albínJosé albín
José albín
 
BADOHU SJ CV Jan-2017
BADOHU SJ CV Jan-2017BADOHU SJ CV Jan-2017
BADOHU SJ CV Jan-2017
 
Gloria
GloriaGloria
Gloria
 
Sandra vargas
Sandra vargasSandra vargas
Sandra vargas
 
Incremente su Productividad
Incremente su ProductividadIncremente su Productividad
Incremente su Productividad
 
Adela
AdelaAdela
Adela
 
Julián david
Julián davidJulián david
Julián david
 
German
GermanGerman
German
 
Ledy
LedyLedy
Ledy
 
CV-January 2017
CV-January 2017CV-January 2017
CV-January 2017
 
César
CésarCésar
César
 
Claudia
ClaudiaClaudia
Claudia
 
Mary luz
Mary luzMary luz
Mary luz
 
Jaime
JaimeJaime
Jaime
 

Similar to The Best Programming Practice for Cell/B.E.

The Basics of Cell Computing Technology
The Basics of Cell Computing TechnologyThe Basics of Cell Computing Technology
The Basics of Cell Computing TechnologySlide_N
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...mjaganm
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAlan Su
 
Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processorcoolmirza143
 
Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processorcoolmirza143
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessorsmumbahelp
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessorsmumbahelp
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessorsmumbahelp
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessorsmumbahelp
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...ssuser65bfce
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersPremier Farnell
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
Project Portfolio - Transferable Skills
Project Portfolio - Transferable SkillsProject Portfolio - Transferable Skills
Project Portfolio - Transferable Skillstuleyb
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationTalal Khaliq
 

Similar to The Best Programming Practice for Cell/B.E. (20)

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
The Basics of Cell Computing Technology
The Basics of Cell Computing TechnologyThe Basics of Cell Computing Technology
The Basics of Cell Computing Technology
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
 
Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processor
 
Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processor
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessor
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessor
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessor
 
Mca4010 microprocessor
Mca4010  microprocessorMca4010  microprocessor
Mca4010 microprocessor
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Project Portfolio - Transferable Skills
Project Portfolio - Transferable SkillsProject Portfolio - Transferable Skills
Project Portfolio - Transferable Skills
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
 

More from Slide_N

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3Slide_N
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplaySlide_N
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac PhysicsSlide_N
 
SPU Shaders
SPU ShadersSPU Shaders
SPU ShadersSlide_N
 
SPU Physics
SPU PhysicsSPU Physics
SPU PhysicsSlide_N
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Slide_N
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War IIISlide_N
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneSlide_N
 

More from Slide_N (20)

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplay
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac Physics
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
 
SPU Physics
SPU PhysicsSPU Physics
SPU Physics
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War III
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

The Best Programming Practice for Cell/B.E.

  • 1. The Best ProgrammingThe Best Programming Practice for Cell/B.E.Practice for Cell/B.E. 20020099..1212..1111 Sony Computer Entertainment Inc.Sony Computer Entertainment Inc. Akira TsukamotoAkira Tsukamoto
  • 2. Who am I?Who am I?
  • 3. Cell Broadband Engine (Cell/B.E.) 1 PPE: general-purpose PowerPC architecture System management 8 SPEs: between general-purpose and special- purpose SIMD capable RISC architecture 128 bit (16 byte) register size 128 registers Programs and data must be located on 256KiB local storage External data access by DMA via MFC Workloads for game, media, etc
  • 4. 9 Cores (1 PPE + 8 SPE) Peak Performance Over 200 GFlops (Single Precision) 4 (32-bit SIMD) * 2 (FMA) Flop * 8 SPE * 3.2GHz = 204.8 GFlops per socket Over 20 GFlops (Double Precision) Up to 25.6 GB/s Memory B/W 35 GB/s (out) + 25 GB/s (in) I/O B/W Over 200 GB/s Total Interconnect B/W 96B/cycle Performance of Cell/B.E. Processor
  • 5. Overview of Cell/B.E. PPE L2 Cache (512KB) MICXDR RAM PPU L1 I/O Controller SPE SPE SPE SPE SPE SPE SPE EIB(ElementInterconnectBus) SPE Dual XDR® Memory Channel XDR RAM SPE: Simple 128-bit SIMD Processor • For Efficient Data Processing • Controlled by PPE programs • New Architecture – Binaries Compiled for SPU run on it SPE: Simple 128-bit SIMD Processor • For Efficient Data Processing • Controlled by PPE programs • New Architecture – Binaries Compiled for SPU run on it PPE: General-Purpose 64-bit PPC Processor • For Operating System and Program Control • PowerPC Compliant – Linux and PPC Linux Apps run on it PPE: General-Purpose 64-bit PPC Processor • For Operating System and Program Control • PowerPC Compliant – Linux and PPC Linux Apps run on it SPULSMFC Local Store (LS): 256KB Local Memory for Instruction and Data (Not a Cache: no trans- parent mechanism to fetch data from main memory) Local Store (LS): 256KB Local Memory for Instruction and Data (Not a Cache: no trans- parent mechanism to fetch data from main memory) EIB: High- Speed Internal Bus (<200GB/s) EIB: High- Speed Internal Bus (<200GB/s) MFC: Interface to Other Components MFC: Interface to Other Components
  • 6. The Future of Cell/B.E. PowerXCell32i (Quad Cell/B.E.) was canceled BUT Current Cell/B.E. production will continue PlayStation®3 (PS3®), IBM QS22 (CellBlade) SPE architecture will be incorporated to Power CPU Cell/B.E. Programming skills are beneficial to achieve good performance on the future computer architecture
  • 7. One big misunderstanding about Cell/B.E. Is SPE one kind of Floating Point Unit (FPU) or Digital Signaling Processor (DSP)? NO SPE is one kind of regular CPU core but equipped with optimized Single Instruction Multiple Data (SIMD) operations. Could run program and process data in their memory called Local Storage (LS) as normal CPU does.
  • 8. SPE vs. GPGPU SPE Very good performance of general instructions if(), switch(), for(), while() are fast in C/C++ language Capable for different processing in parallel (Task parallel model) 2 SPEs for Physics engine, 2 SPEs for vision recognition, 2 SPEs for codec GPGPU Limited performance on general instructions Not good for different processing in parallel (Task parallel model) Suitable for processing large data with the same calculation (Data parallel model) SPE is better for general purpose processing to adopt wide range of programming
  • 9. Cell/B.E. Programming Environment PPE toolchain One of PowerPC targets gcc and binutils with Cell/B.E. specific enhancements SPE toolchain New target architecture spu-gcc, binutils, newlib (libc), ... libspe SPE management library Provides OS-independent API by wrapping the raw SPUFS interfaces MARS Provides effective SPE runtime environment
  • 10. Cell/B.E. Programming Environment GCC, BINUTILS, GDB, basic SPU libs (libspe2) PS3 platform support Hypervisor PS3® QS20, 21, 22 (Cell Blade) platform support QS20, 21, 22 Toshiba Ref. Set platform support Toshiba Ref. Set IBMTOSHIBA Sony/SCEI SPU Runtime libs (MARS) SPU middleware, SPU programs - Various Codecs, Various Physics, - Face detection, - Various motion detections, …. Programming Tools Common Linux Kernel Infrastructure (SPUFS) ReusableReusable ZEGO platform support ZEGO®
  • 11. Hello World Programming on Cell/B.E.
  • 12. SPE programming SPE program prints “Hello World!” SPE program prints “Hello World!” #include <stdio.h> int main() { printf(“Hello, World!¥n”); return 0; } $ spu-gcc hello.c -o hello.spe $ ./hello.spe
  • 14. Optimizing SPE program Regular programming on SPE do not achieve Over 200 GFlops performance Requires optimization on SPE programming
  • 17. vector type extension on spu-gcc Two double-precision floating-point data__vector double Four single-precision floating-point data__vector float Two signed 64-bit data__vector signed long long Two unsigned 64-bit data__vector unsigned long long Four signed 32-bit data__vector signed int Four unsigned 32-bit data__vector unsigned int Eight signed 16-bit data__vector signed short Eight unsigned 16-bit data__vector unsigned short Sixteen signed 8-bit data__vector signed char Sixteen unsigned 8-bit data__vector unsigned char DataVector Type
  • 18. vector type extension on spu-gcc
  • 19. SIMD programming float a[4], b[4], c[4]; for (i = 0; i < 4; i++) { c[i] = a[i] * b[i]; } __vector float va, vb, vc; vc = spu_mul(va, vb);
  • 20. Other SIMD Built-in Functions Finds the bitwise logical sums (OR) between vectors a and b. spu_or(a,b)vec_or(a,b) Finds the bitwise logical products (AND) between vectors a and b. spu_and(a,b)vec_and(a,b)Logical Instruction s Calculates the square roots of the reciprocals of the elements of vector a. spu_rsqrte(a)vec_rsqrte(a) Calculates the reciprocals of the elements of vector a. spu_re(a,b)vec_re(a,b) Multiplies the elements of vector a by the elements of vector b and adds the elements of vector c. spu_madd(a,b,c)vec_madd(a,b,c) Performs subtractions between the elements of vectors a and b. spu _sub(a,b)vec_sub(a,b) Adds the elements of vectors a and b.spu_add(a,b)vec_add(a,b)Arithmetic Instruction s DescriptionSPU SIMDVMXApplicable Instructions
  • 21. Use Double Buffering DMA between Main memory and LS
  • 22. DMA between Main memory and LS Main Storage The EA space is shared with SPE The EA space is shared with SPE SPE SPU LS MFC DMAC in the MFC is responsible to transfer the data DMAC in the MFC is responsible to transfer the data GET or PUT Address (EA) Size Please transfer data!
  • 23. Single buffering and double buffering SPU DMA request WaitingSPU MFC DMA request ProcessingWaiting DMA request Waiting DMA transfer Processing DMA transfer Single buffering 処理 Waiting MFC DMA request 処理 DMA request 処理 DMA request Processing Processing Double buffering Processing DMA request DMA transfer DMA transfer DMA transfer DMA transfer
  • 25. How to Align Data 128 byte aligned data is best for DMA 16 byte aligned data is best for SPE instructions Use gcc’s aligned attribute for static or global data Example: 16-bytes aligned integer variable Example: defining a 128-bytes-aligned structure type Use posix_memalign for dynamic allocation __attribute__((aligned(align_size))) int a __attribute__((aligned(16))); typedef struct { int a; char b; } __attribute__((aligned(128))) aligned_struct_t; #define _XOPEN_SOURCE 600 /* include POSIX 6th definition */ #include <stdlib.h> int posix_memalign(void **ptr, size_t 16, size_t size);
  • 27. Unroll for loop SPE has 128 entries of registers for (i = 0; i < N; i += 16) { vec_float4 av0 = *(vec_float4*)(a + i); vec_float4 bv0 = *(vec_float4*)(b + i); vec_float4 av1 = *(vec_float4*)(a + i + 4); ... vec_float4 cv0 = av0 * bv0; vec_float4 cv1 = av1 * bv1; ... *(vec_float4*)(c + i) = cv0; *(vec_float4*)(c + i + 4) = cv1; ... } for (i = 0; i < N; i += 4) { *(vec_float4*)&c[i] = *(vec_float4*)&a[i] * *(vec_float4*)&b[i]); } Compute on registers Load the input to registers
  • 29. Typical Cell/B.E. Program 1 PPE program User interface Data input/output Loading and executing SPE programs Multiple SPE programs Image processing Physics simulation Scientific calculation
  • 30. PPE Centric Programming Model PPE is responsible for: Loading/switching of SPE programs Sending/receiving of necessary data to its SPE programs
  • 31. Problems of PPE Centric Programming Difficult for the PPE to know SPE's status Stalls, waits... Inefficient scheduling of SPE programs Extra load of the PPE Communication Scheduling
  • 32. Preparation for MARSPreparation for MARS MultiMulti--core Application Runtime Systemcore Application Runtime System
  • 33. What is MARS?What is MARS?
  • 34. MARS MARS stands for Multi-core Application Runtime System Provides efficient runtime environment for SPE centric application programs
  • 35. SPE Centric Programming Model The individual SPEs are responsible for: Loading, executing and switching SPE programs Sending/receiving data between SPEs
  • 36. What MARS Provides PPE Centric Programming model is slow Use PPE as less as possible MARS provides SPE centric runtime without complicate programming: Scheduling workloads by SPEs Lightweight context switching Synchronization objects cooperating with the scheduler
  • 37. MARS Advantages Simplifies maximizing SPE usage Efficient context switching Minimizes data exchanged with PPE Minimizes runtime load of the PPE
  • 38. MARS Task Sync Objects Semaphores, event flags, queues... Waiting condition results in a task switch Avoiding wasting time just on waiting
  • 40. Typical Usage Scenario 1. PPE creates MARS context 2. PPE creates task objects 3. PPE creates synchronization objects 4. PPE starts the initial tasks 5. The existing tasks start additional tasks 6. The tasks do application specific works 7. PPE waits for tasks 8. PPE destroys task objects and sync objects 9. PPE destroys MARS context
  • 41. 10 int main(void) 11 { 12 struct mars_context *mars_ctx; 13 struct mars_task_id task1_id; 14 static struct mars_task_id task2_id[NUM_SUB_TASKS] __attribute__((aligned(16))); 15 struct mars_task_args task_args; 16 int i; 17 18 mars_context_create(&mars_ctx, 0, 0); 19 20 mars_task_create(mars_ctx, &task1_id, "Task 1", spe_main_prog.elf_image, 21 MARS_TASK_CONTEXT_SAVE_ALL); 22 23 for (i = 0; i < NUM_SUB_TASKS; i++) { 24 char name[16]; 25 sprintf(name, "Task 2.%d", i); 26 mars_task_create(mars_ctx, &task2_id[i], name, spe_calc_prog.elf_image, 27 MARS_TASK_CONTEXT_SAVE_ALL); 28 } 29 30 task_args.type.u64[0] = mars_ptr_to_ea(&task2_id[0]); 31 task_args.type.u64[1] = mars_ptr_to_ea(&task2_id[1]); 32 33 /* start main SPE MARS task */ 34 mars_task_schedule(&task1_id, &task_args, 0); 35 36 mars_task_wait(&task1_id, NULL); 37 mars_task_destroy(&task1_id); 38 39 for (i = 0; i < NUM_SUB_TASKS; i++) 40 mars_task_destroy(&task2_id[i]); 41 42 mars_context_destroy(mars_ctx); 43 44 return 0; 45 } Preparation program on PPE
  • 42. 1 #include <stdlib.h> 2 #include <spu_mfcio.h> 3 #include <mars/mars.h> 4 5 #define DMA_TAG 0 6 7 int mars_task_main(const struct mars_task_args *task_args) 8 { 9 static struct mars_task_id task2_0_id __attribute__((aligned(16))); 10 static struct mars_task_id task2_1_id __attribute__((aligned(16))); 11 struct mars_task_args args; 12 13 mfc_get(&task2_0_id, task_args->type.u64[0], sizeof(task2_0_id), DMA_TAG, 0, 0); 14 mfc_get(&task2_1_id, task_args->type.u64[1], sizeof(task2_1_id), DMA_TAG, 0, 0); 15 mfc_write_tag_mask(1 << DMA_TAG); 16 mfc_read_tag_status_all(); 17 18 /* start calculation SPE MARS task 0 */ 19 args.type.u32[0] = 123; 20 mars_task_schedule(&task2_0_id, &args, 0); 21 22 /* start calculation SPE MARS task 1 */ 23 args.type.u32[0] = 321; 24 mars_task_schedule(&task2_1_id, &args, 0); 25 26 mars_task_wait(&task2_0_id, NULL); 27 mars_task_wait(&task2_1_id, NULL); 28 29 return 0; 30 } Main MARS task program on SPE
  • 43. 1 #include <stdio.h> 2 #include <mars/mars.h> 3 4 int mars_task_main(const struct mars_task_args *task_args) 5 { 6 7 /* do some calculations here */ 8 9 printf("MPU(%d): %s - Hello! (%d)¥n", 10 mars_task_get_kernel_id(), mars_task_get_name(), 11 task_args->type.u32[0]); 12 13 return 0; 14 } Program for processing on SPE
  • 44. MARS synchronization API Barrier This is used to make multiple MARS tasks wait at a certain point in a program and to resume the task execution when all tasks are ready. Event Flag This is used to send event notifications between MARS tasks or between MARS tasks and host programs. Queue This is used to provide a FIFO queue mechanism for data transfer between MARS tasks or between MARS tasks and host programs. Semaphore This is used to limit the number of concurrent accesses to shared resources among MARS tasks. Task Signal This is used to signal a MARS task in the waiting state to change state so that it can be scheduled to continue execution.
  • 46. Sample Application: OpenSSL OpenSSL Application HDD OpenSSL libcrypto MARS OpenSSL Engine PPE part OpenSSL Engine API OpenSSL API PPE SPEs MARS OpenSSL Engine task MARS Kernel Input MARS Queue Output MARS Queue
  • 47. Benchmarking: OpenSSL PPE vs SPE 0 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 seconds # of simultaneous tasks (# of input streams) PPE vs libspe2 (calculating SHA256 of a 4GB+ file) PPE real PPE user PPE sys libspe2 real libspe2 user libspe2 sys
  • 48. Benchmarking: OpenSSL libspe2 vs MARS 0 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 seconds # of simultaneous tasks (# of input streams) libspe2 vs MARS (calculating SHA256 of a 4GB+ file) libspe2 real libspe2 user libspe2 sys MARS real MARS user MARS sys libspe2 real : + libspe2 user: ■ libspe2 sys : ○ MARS real : + MARS user : ■ MARS sys : ○
  • 49. How to Approach toHow to Approach to Cell/B.E. TechnicalCell/B.E. Technical InformationInformation
  • 50. Information on Cell/B.E. programming Cell/B.E. programming document http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux- docs/ps3-linux-docs-08.06.09/CellProgrammingPrimer.html PS3 Linux Public Information http://www.playstation.com/ps3-openplatform/index.html http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ Cell/B.E. information by IBM http://www.ibm.com/developerworks/power/cell/documents.html http://www.bsc.es/projects/deepcomputing/linuxoncell/ Cell/B.E. Discussion Mailing List: cbe-oss-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/cbe-oss-dev Cell/B.E. Discussion IRC: #cell at irc.freenode.org MARS Releases, Source Code, Samples http://ftp.uk.linux.org/pub/linux/Sony-PS3/mars/ MARS Development Repositories: git://git.infradead.org/ps3/mars-src.git http://git.infradead.org/ps3/mars-src.git