Your SlideShare is downloading. ×
0
20131121
20131121
20131121
20131121
20131121
20131121
20131121
20131121
20131121
20131121
20131121
20131121
20131121
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20131121

30

Published on

Delta R621 …

Delta R621
15:10~15:58

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
30
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. WEEKLY REPORT Thur., Nov 21, 2013 Pin Yi Tsai
  • 2. OUTLINE • Current Work • Compute Integral Image – computeByRow  Using shared memory  Measure time in CUDA  Result  Conclusion
  • 3. USING SHARED MEMORY • Scope: block • Each thread deal with one row, in every iteration:  Write to shared memory first  Read the previous result from shared memory
  • 4. USING SHARED MEMORY (CONT.) • Scope: block • Each thread deal with one row  Store the result to shared memory  Write back to the global memory in the end
  • 5. USING SHARED MEMORY (CONT.) • Limitation: 49152 KB per block  Float: 4 bytes  12288 units / width => X rows per block • Segment the large image to several parts  Avoid the size exceeding the limitation
  • 6. USING SHARED MEMORY (CONT.) • 49152 KB per block  Float: 4 bytes  12288 units / 641 => 19 rows per block  19 rows per block, 26 segments (height: 481)
  • 7. TESLA M2050
  • 8. MEASURE TIME IN CUDA • cudaThreadSynchronize()  similar to the non-deprecated function cudaDeviceSynchronize()  returns an error if one of the preceding tasks has failed • cudaDeviceSynchronize()  blocks until the device has completed all preceding requested tasks • The first one is deprecated because its name does not reflect its behavior
  • 9. RESULT • 16x16 (using shared memory with size of full image) • Serial version: 0.00656 ms • Parallel version: 0.197344 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 56.85 19.73us 1 19.73us 19.73us 19.73us computeByRow(float*, int, int) 25.17 8.73us 1 8.73us 8.73us 8.73us computeByColumn(float*, int, int) 12.54 4.35us 2 2.18us 2.18us 2.18us [CUDA memcpy DtoH] 5.44 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]
  • 10. RESULT (CONT.) • 640*480 • Using shared memory per line • Serial version: 5.11238 ms • Parallel version: 4.361386 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 66.36 2.18ms 1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int) 12.72 418.14us 2 209.07us 208.45us 209.70us [CUDA memcpy HtoD] 11.75 386.21us 2 193.10us 191.04us 195.17us [CUDA memcpy DtoH] 9.16 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)
  • 11. RESULT (CONT.) • 640*480 • Using segment image and shared memory • Serial version: 5.11238 ms • Parallel version: 70.0833 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 98.22 66.23ms 26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int) 0.69 467.46us 27 17.31us 9.79us 209.76us [CUDA memcpy HtoD] 0.64 429.60us 27 15.91us 8.93us 195.58us [CUDA memcpy DtoH] 0.45 301.18us 1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)
  • 12. CONCLUSION • The method doesn’t improve the performance • Find the new method to write the massive data from shared memory to the global memory
  • 13. The End

×