2. OUTLINE
• Current Work
• Compute Integral Image – computeByRow
Using shared memory
Using register
Result
• CUDA Memory Architecture
3. USING SHARED MEMORY
• Scope: block
• Shared memory: store the values of the previous line
• computing by Row for img[*][y] and img[*][y+1]
• Time t: calculate img[*][y] + shared memory[*]
• Then store the result back to shared memory[*]
• Time t+1: calculate img[*][y+1] + shared memory[*]
4. USING REGISTER
• Scope: thread
• One line one thread
Why not one pixel one thread? The use of _syncthread();
• Using register: store the values of the previous pixel
5. RESULT
• 16x16
• Serial version: 0.006336 ms
• Parallel version: 5.88559e-39 ms
======== Profiling result:
Time(%)
Time Calls
Avg
Min
Max Name
55.69 18.91us
1 18.91us 18.91us 18.91us computeByRow(float*, int, int)
25.84
8.78us
1
8.78us
12.91
4.38us
2
2.19us 2.18us 2.21us [CUDA memcpy DtoH]
5.56
1.89us
2
944ns
8.78us
928ns
8.78us computeByColumn(float*, int, int)
960ns [CUDA memcpy HtoD]
6. RESULT (CONT.)
• 640*480
• Serial version: 5.1607 ms
• Parallel version: 4.40496 ms
======== Profiling result:
Time(%)
Time Calls
Avg
Min
Max Name
66.37 2.19ms
1 2.19ms 2.19ms 2.19ms computeByRow(float*, int, int)
12.75 419.74us
2 209.87us 209.28us 210.46us [CUDA memcpy HtoD]
11.74 386.43us
2 193.22us 191.04us 195.39us [CUDA memcpy DtoH]
9.15 301.24us
1 301.24us 301.24us 301.24us
computeByColumn(float*, int, int)