WEEKLY REPORT
Thur., Nov 21, 2013
Pin Yi Tsai
OUTLINE
• Current Work
• Compute Integral Image – computeByRow
 Using shared memory
 Measure time in CUDA

 Result
 Co...
USING SHARED MEMORY

• Scope: block
• Each thread deal with one row, in every iteration:
 Write to shared memory first

...
USING SHARED MEMORY (CONT.)

• Scope: block
• Each thread deal with one row
 Store the result to shared memory

 Write b...
USING SHARED MEMORY (CONT.)
• Limitation: 49152 KB per block

 Float: 4 bytes
 12288 units / width => X rows per block

...
USING SHARED MEMORY (CONT.)
• 49152 KB per block

 Float: 4 bytes
 12288 units / 641 => 19 rows per block
 19 rows per ...
TESLA M2050
MEASURE TIME IN CUDA
• cudaThreadSynchronize()

 similar to the non-deprecated function cudaDeviceSynchronize()
 returns...
RESULT
• 16x16 (using shared memory with size of full image)

• Serial version: 0.00656 ms
• Parallel version: 0.197344 ms...
RESULT (CONT.)
• 640*480
• Using shared memory per line
• Serial version: 5.11238 ms
• Parallel version: 4.361386 ms
=====...
RESULT (CONT.)
• 640*480
• Using segment image and shared memory
• Serial version: 5.11238 ms
• Parallel version: 70.0833 ...
CONCLUSION
• The method doesn’t improve the performance

• Find the new method to write the massive data from shared memor...
The End
Upcoming SlideShare
Loading in …5
×

20131121

166 views
81 views

Published on

Delta R621
15:10~15:58

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
166
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20131121

  1. 1. WEEKLY REPORT Thur., Nov 21, 2013 Pin Yi Tsai
  2. 2. OUTLINE • Current Work • Compute Integral Image – computeByRow  Using shared memory  Measure time in CUDA  Result  Conclusion
  3. 3. USING SHARED MEMORY • Scope: block • Each thread deal with one row, in every iteration:  Write to shared memory first  Read the previous result from shared memory
  4. 4. USING SHARED MEMORY (CONT.) • Scope: block • Each thread deal with one row  Store the result to shared memory  Write back to the global memory in the end
  5. 5. USING SHARED MEMORY (CONT.) • Limitation: 49152 KB per block  Float: 4 bytes  12288 units / width => X rows per block • Segment the large image to several parts  Avoid the size exceeding the limitation
  6. 6. USING SHARED MEMORY (CONT.) • 49152 KB per block  Float: 4 bytes  12288 units / 641 => 19 rows per block  19 rows per block, 26 segments (height: 481)
  7. 7. TESLA M2050
  8. 8. MEASURE TIME IN CUDA • cudaThreadSynchronize()  similar to the non-deprecated function cudaDeviceSynchronize()  returns an error if one of the preceding tasks has failed • cudaDeviceSynchronize()  blocks until the device has completed all preceding requested tasks • The first one is deprecated because its name does not reflect its behavior
  9. 9. RESULT • 16x16 (using shared memory with size of full image) • Serial version: 0.00656 ms • Parallel version: 0.197344 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 56.85 19.73us 1 19.73us 19.73us 19.73us computeByRow(float*, int, int) 25.17 8.73us 1 8.73us 8.73us 8.73us computeByColumn(float*, int, int) 12.54 4.35us 2 2.18us 2.18us 2.18us [CUDA memcpy DtoH] 5.44 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]
  10. 10. RESULT (CONT.) • 640*480 • Using shared memory per line • Serial version: 5.11238 ms • Parallel version: 4.361386 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 66.36 2.18ms 1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int) 12.72 418.14us 2 209.07us 208.45us 209.70us [CUDA memcpy HtoD] 11.75 386.21us 2 193.10us 191.04us 195.17us [CUDA memcpy DtoH] 9.16 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)
  11. 11. RESULT (CONT.) • 640*480 • Using segment image and shared memory • Serial version: 5.11238 ms • Parallel version: 70.0833 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 98.22 66.23ms 26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int) 0.69 467.46us 27 17.31us 9.79us 209.76us [CUDA memcpy HtoD] 0.64 429.60us 27 15.91us 8.93us 195.58us [CUDA memcpy DtoH] 0.45 301.18us 1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)
  12. 12. CONCLUSION • The method doesn’t improve the performance • Find the new method to write the massive data from shared memory to the global memory
  13. 13. The End

×