20131121

•Download as PPTX, PDF•

0 likes•237 views

Jocelyn

Delta R621 15:10~15:58

Technology

WEEKLY REPORT
Thur., Nov 21, 2013
Pin Yi Tsai

OUTLINE
• Current Work
• Compute Integral Image – computeByRow
 Using shared memory
 Measure time in CUDA

 Result
 Conclusion

USING SHARED MEMORY

• Scope: block
• Each thread deal with one row, in every iteration:
 Write to shared memory first

 Read the previous result from shared memory

USING SHARED MEMORY (CONT.)

• Scope: block
• Each thread deal with one row
 Store the result to shared memory

 Write back to the global memory in the end

USING SHARED MEMORY (CONT.)
• Limitation: 49152 KB per block

 Float: 4 bytes
 12288 units / width => X rows per block

• Segment the large image to several parts
 Avoid the size exceeding the limitation

USING SHARED MEMORY (CONT.)
• 49152 KB per block

 Float: 4 bytes
 12288 units / 641 => 19 rows per block
 19 rows per block, 26 segments (height: 481)

MEASURE TIME IN CUDA
• cudaThreadSynchronize()

 similar to the non-deprecated function cudaDeviceSynchronize()
 returns an error if one of the preceding tasks has failed
• cudaDeviceSynchronize()

 blocks until the device has completed all preceding requested
tasks
• The first one is deprecated because its name does not reflect its
behavior

RESULT
• 16x16 (using shared memory with size of full image)

• Serial version: 0.00656 ms
• Parallel version: 0.197344 ms
======== Profiling result:
Time(%)

Time Calls

Avg

Min

Max Name

56.85 19.73us

1 19.73us 19.73us 19.73us computeByRow(float*, int, int)

25.17

8.73us

1

8.73us

8.73us

8.73us computeByColumn(float*, int, int)

12.54

4.35us

2

2.18us

2.18us

2.18us [CUDA memcpy DtoH]

5.44

1.89us

2

944ns

928ns

960ns [CUDA memcpy HtoD]

RESULT (CONT.)
• 640*480
• Using shared memory per line
• Serial version: 5.11238 ms
• Parallel version: 4.361386 ms
======== Profiling result:
Time(%)

Time Calls

Avg

Min

Max Name

66.36 2.18ms

1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int)

12.72 418.14us

2 209.07us 208.45us 209.70us [CUDA memcpy HtoD]

11.75 386.21us

2 193.10us 191.04us 195.17us [CUDA memcpy DtoH]

9.16 301.24us

1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)

RESULT (CONT.)
• 640*480
• Using segment image and shared memory
• Serial version: 5.11238 ms
• Parallel version: 70.0833 ms
======== Profiling result:

Time(%)

Time Calls

Avg

Min

Max Name

98.22 66.23ms

26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int)

0.69 467.46us

27 17.31us 9.79us 209.76us [CUDA memcpy HtoD]

0.64 429.60us

27 15.91us 8.93us 195.58us [CUDA memcpy DtoH]

0.45 301.18us

1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)

CONCLUSION
• The method doesn’t improve the performance

• Find the new method to write the massive data from shared memory to
the global memory

What's hot

Porting and optimizing UniFrac for GPUsIgor Sfiligoi

Thesis Final PresentationMd. Kamal Hossain

FlameWorks GTC 2014Simon Green

Parallel implementation of geodesic distance transform with application in su...Tuan Q. Pham

CUDA and Caffe for deep learningAmgad Muhammad

OpenGL 4.4 - Scene Rendering TechniquesNarann29

有點硬又不會太硬的DNN加速器Rouyun Pan

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill

Squeeeze modelsDong Heon Cho

Cassandra at talkbitsMax Alexejev

GLSLSyed Zaid Irshad

ES_SAA_OG_PF_ECCTD_PosSyed Asad Alam

Post renderingAkilarLiao

Beyond portingCass Everitt

Network Analysis with networkX : Real-World Example-2Kyunghoon Kim

Exploring Gpgpu WorkloadsUnai Lopez-Novoa

Parallel Implementation of K Means Clustering on CUDAprithan

Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain

Lab: Foundation of Concurrent and Distributed SystemsRuochun Tzeng

Advanced Scenegraph Rendering PipelineNarann29

What's hot (20)

Porting and optimizing UniFrac for GPUs

Thesis Final Presentation

FlameWorks GTC 2014

Parallel implementation of geodesic distance transform with application in su...

CUDA and Caffe for deep learning

OpenGL 4.4 - Scene Rendering Techniques

有點硬又不會太硬的DNN加速器

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...

Squeeeze models

Cassandra at talkbits

GLSL

ES_SAA_OG_PF_ECCTD_Pos

Post rendering

Beyond porting

Network Analysis with networkX : Real-World Example-2

Exploring Gpgpu Workloads

Parallel Implementation of K Means Clustering on CUDA

Exploring Parallel Merging In GPU Based Systems Using CUDA C.

Lab: Foundation of Concurrent and Distributed Systems

Advanced Scenegraph Rendering Pipeline

Similar to 20131121

20131114Jocelyn

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Daosheng Mu

20131024Jocelyn

Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu

1083 wangAndre Bueno

lecture11_GPUArchCUDA01.pptxssuser413a98

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology

Linux kernel memory allocatorsHao-Ran Liu

Adaptive Linear Solvers and Eigensolversinside-BigData.com

Designing and coding Series 40 Java apps for high performanceMicrosoft Mobile Developer

002 - Introduction to CUDA Programming_1.pptceyifo9332

Deep Learning for Computer Vision: Memory usage and computational considerati...Universitat Politècnica de Catalunya

GPU Introduction.pptxSherazMunawar5

Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515

Unity - Internals: memory and performanceCodemotion

7_mem_cache.pptRohitPaul71

Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit

Everything I Ever Learned About JVM Performance Tuning @TwitterAttila Szegedi

Theta and the Future of Accelerator Programminginside-BigData.com

“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid

Similar to 20131121 (20)

20131114

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...

20131024

Optimizing Parallel Reduction in CUDA : NOTES

1083 wang

lecture11_GPUArchCUDA01.pptx

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...

Linux kernel memory allocators

Adaptive Linear Solvers and Eigensolvers

Designing and coding Series 40 Java apps for high performance

002 - Introduction to CUDA Programming_1.ppt

Deep Learning for Computer Vision: Memory usage and computational considerati...

GPU Introduction.pptx

Speedrunning the Open Street Map osm2pgsql Loader

Unity - Internals: memory and performance

7_mem_cache.ppt

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Everything I Ever Learned About JVM Performance Tuning @Twitter

Theta and the Future of Accelerator Programming

“Show Me the Garbage!”, Garbage Collection a Friend or a Foe

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Histor y of HAM Radio presentation slidevu2urc

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Real Time Object Detection Using Open CVKhem

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Slack Application Development 101 Slidespraypatel2

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

How to convert PDF to text with Nanonetsnaman860154

A Year of the Servo Reboot: Where Are We Now?Igalia

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Scaling API-first – The story of a global engineering organization

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Histor y of HAM Radio presentation slide

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

A Domino Admins Adventures (Engage 2024)

Real Time Object Detection Using Open CV

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

GenCyber Cyber Security Day Presentation

Slack Application Development 101 Slides

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Axa Assurance Maroc - Insurer Innovation Award 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Boost PC performance: How more available memory can improve productivity

How to convert PDF to text with Nanonets

A Year of the Servo Reboot: Where Are We Now?

20131121

1. WEEKLY REPORT Thur., Nov 21, 2013 Pin Yi Tsai

2. OUTLINE • Current Work • Compute Integral Image – computeByRow  Using shared memory  Measure time in CUDA  Result  Conclusion

3. USING SHARED MEMORY • Scope: block • Each thread deal with one row, in every iteration:  Write to shared memory first  Read the previous result from shared memory

4. USING SHARED MEMORY (CONT.) • Scope: block • Each thread deal with one row  Store the result to shared memory  Write back to the global memory in the end

5. USING SHARED MEMORY (CONT.) • Limitation: 49152 KB per block  Float: 4 bytes  12288 units / width => X rows per block • Segment the large image to several parts  Avoid the size exceeding the limitation

6. USING SHARED MEMORY (CONT.) • 49152 KB per block  Float: 4 bytes  12288 units / 641 => 19 rows per block  19 rows per block, 26 segments (height: 481)

7. TESLA M2050

8. MEASURE TIME IN CUDA • cudaThreadSynchronize()  similar to the non-deprecated function cudaDeviceSynchronize()  returns an error if one of the preceding tasks has failed • cudaDeviceSynchronize()  blocks until the device has completed all preceding requested tasks • The first one is deprecated because its name does not reflect its behavior

9. RESULT • 16x16 (using shared memory with size of full image) • Serial version: 0.00656 ms • Parallel version: 0.197344 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 56.85 19.73us 1 19.73us 19.73us 19.73us computeByRow(float*, int, int) 25.17 8.73us 1 8.73us 8.73us 8.73us computeByColumn(float*, int, int) 12.54 4.35us 2 2.18us 2.18us 2.18us [CUDA memcpy DtoH] 5.44 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]

10. RESULT (CONT.) • 640*480 • Using shared memory per line • Serial version: 5.11238 ms • Parallel version: 4.361386 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 66.36 2.18ms 1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int) 12.72 418.14us 2 209.07us 208.45us 209.70us [CUDA memcpy HtoD] 11.75 386.21us 2 193.10us 191.04us 195.17us [CUDA memcpy DtoH] 9.16 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)

11. RESULT (CONT.) • 640*480 • Using segment image and shared memory • Serial version: 5.11238 ms • Parallel version: 70.0833 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 98.22 66.23ms 26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int) 0.69 467.46us 27 17.31us 9.79us 209.76us [CUDA memcpy HtoD] 0.64 429.60us 27 15.91us 8.93us 195.58us [CUDA memcpy DtoH] 0.45 301.18us 1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)

12. CONCLUSION • The method doesn’t improve the performance • Find the new method to write the massive data from shared memory to the global memory

13. The End

20131121

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20131121

Similar to 20131121 (20)

Recently uploaded

Recently uploaded (20)

20131121