SlideShare a Scribd company logo
1 of 13
WEEKLY REPORT
Thur., Nov 21, 2013
Pin Yi Tsai
OUTLINE
• Current Work
• Compute Integral Image – computeByRow
 Using shared memory
 Measure time in CUDA

 Result
 Conclusion
USING SHARED MEMORY

• Scope: block
• Each thread deal with one row, in every iteration:
 Write to shared memory first

 Read the previous result from shared memory
USING SHARED MEMORY (CONT.)

• Scope: block
• Each thread deal with one row
 Store the result to shared memory

 Write back to the global memory in the end
USING SHARED MEMORY (CONT.)
• Limitation: 49152 KB per block

 Float: 4 bytes
 12288 units / width => X rows per block

• Segment the large image to several parts
 Avoid the size exceeding the limitation
USING SHARED MEMORY (CONT.)
• 49152 KB per block

 Float: 4 bytes
 12288 units / 641 => 19 rows per block
 19 rows per block, 26 segments (height: 481)
TESLA M2050
MEASURE TIME IN CUDA
• cudaThreadSynchronize()

 similar to the non-deprecated function cudaDeviceSynchronize()
 returns an error if one of the preceding tasks has failed
• cudaDeviceSynchronize()

 blocks until the device has completed all preceding requested
tasks
• The first one is deprecated because its name does not reflect its
behavior
RESULT
• 16x16 (using shared memory with size of full image)

• Serial version: 0.00656 ms
• Parallel version: 0.197344 ms
======== Profiling result:
Time(%)

Time Calls

Avg

Min

Max Name

56.85 19.73us

1 19.73us 19.73us 19.73us computeByRow(float*, int, int)

25.17

8.73us

1

8.73us

8.73us

8.73us computeByColumn(float*, int, int)

12.54

4.35us

2

2.18us

2.18us

2.18us [CUDA memcpy DtoH]

5.44

1.89us

2

944ns

928ns

960ns [CUDA memcpy HtoD]
RESULT (CONT.)
• 640*480
• Using shared memory per line
• Serial version: 5.11238 ms
• Parallel version: 4.361386 ms
======== Profiling result:
Time(%)

Time Calls

Avg

Min

Max Name

66.36 2.18ms

1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int)

12.72 418.14us

2 209.07us 208.45us 209.70us [CUDA memcpy HtoD]

11.75 386.21us

2 193.10us 191.04us 195.17us [CUDA memcpy DtoH]

9.16 301.24us

1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)
RESULT (CONT.)
• 640*480
• Using segment image and shared memory
• Serial version: 5.11238 ms
• Parallel version: 70.0833 ms
======== Profiling result:

Time(%)

Time Calls

Avg

Min

Max Name

98.22 66.23ms

26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int)

0.69 467.46us

27 17.31us 9.79us 209.76us [CUDA memcpy HtoD]

0.64 429.60us

27 15.91us 8.93us 195.58us [CUDA memcpy DtoH]

0.45 301.18us

1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)
CONCLUSION
• The method doesn’t improve the performance

• Find the new method to write the massive data from shared memory to
the global memory
The End

More Related Content

What's hot

Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
FlameWorks GTC 2014
FlameWorks GTC 2014FlameWorks GTC 2014
FlameWorks GTC 2014Simon Green
 
Parallel implementation of geodesic distance transform with application in su...
Parallel implementation of geodesic distance transform with application in su...Parallel implementation of geodesic distance transform with application in su...
Parallel implementation of geodesic distance transform with application in su...Tuan Q. Pham
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesNarann29
 
有點硬又不會太硬的DNN加速器
有點硬又不會太硬的DNN加速器有點硬又不會太硬的DNN加速器
有點硬又不會太硬的DNN加速器Rouyun Pan
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill
 
Cassandra at talkbits
Cassandra at talkbitsCassandra at talkbits
Cassandra at talkbitsMax Alexejev
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosSyed Asad Alam
 
Post rendering
Post renderingPost rendering
Post renderingAkilarLiao
 
Network Analysis with networkX : Real-World Example-2
Network Analysis with networkX : Real-World Example-2Network Analysis with networkX : Real-World Example-2
Network Analysis with networkX : Real-World Example-2Kyunghoon Kim
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsRuochun Tzeng
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 

What's hot (20)

Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Thesis Final Presentation
Thesis Final PresentationThesis Final Presentation
Thesis Final Presentation
 
FlameWorks GTC 2014
FlameWorks GTC 2014FlameWorks GTC 2014
FlameWorks GTC 2014
 
Parallel implementation of geodesic distance transform with application in su...
Parallel implementation of geodesic distance transform with application in su...Parallel implementation of geodesic distance transform with application in su...
Parallel implementation of geodesic distance transform with application in su...
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
有點硬又不會太硬的DNN加速器
有點硬又不會太硬的DNN加速器有點硬又不會太硬的DNN加速器
有點硬又不會太硬的DNN加速器
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
Squeeeze models
Squeeeze modelsSqueeeze models
Squeeeze models
 
Cassandra at talkbits
Cassandra at talkbitsCassandra at talkbits
Cassandra at talkbits
 
GLSL
GLSLGLSL
GLSL
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_Pos
 
Post rendering
Post renderingPost rendering
Post rendering
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Network Analysis with networkX : Real-World Example-2
Network Analysis with networkX : Real-World Example-2Network Analysis with networkX : Real-World Example-2
Network Analysis with networkX : Real-World Example-2
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed Systems
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 

Similar to 20131121

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Daosheng Mu
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Designing and coding Series 40 Java apps for high performance
Designing and coding Series 40 Java apps for high performanceDesigning and coding Series 40 Java apps for high performance
Designing and coding Series 40 Java apps for high performanceMicrosoft Mobile Developer
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Universitat Politècnica de Catalunya
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptxSherazMunawar5
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515
 
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performanceCodemotion
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterAttila Szegedi
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid
 

Similar to 20131121 (20)

20131114
2013111420131114
20131114
 
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
 
20131024
2013102420131024
20131024
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
1083 wang
1083 wang1083 wang
1083 wang
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Designing and coding Series 40 Java apps for high performance
Designing and coding Series 40 Java apps for high performanceDesigning and coding Series 40 Java apps for high performance
Designing and coding Series 40 Java apps for high performance
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
 
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performance
 
7_mem_cache.ppt
7_mem_cache.ppt7_mem_cache.ppt
7_mem_cache.ppt
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

20131121

  • 1. WEEKLY REPORT Thur., Nov 21, 2013 Pin Yi Tsai
  • 2. OUTLINE • Current Work • Compute Integral Image – computeByRow  Using shared memory  Measure time in CUDA  Result  Conclusion
  • 3. USING SHARED MEMORY • Scope: block • Each thread deal with one row, in every iteration:  Write to shared memory first  Read the previous result from shared memory
  • 4. USING SHARED MEMORY (CONT.) • Scope: block • Each thread deal with one row  Store the result to shared memory  Write back to the global memory in the end
  • 5. USING SHARED MEMORY (CONT.) • Limitation: 49152 KB per block  Float: 4 bytes  12288 units / width => X rows per block • Segment the large image to several parts  Avoid the size exceeding the limitation
  • 6. USING SHARED MEMORY (CONT.) • 49152 KB per block  Float: 4 bytes  12288 units / 641 => 19 rows per block  19 rows per block, 26 segments (height: 481)
  • 8. MEASURE TIME IN CUDA • cudaThreadSynchronize()  similar to the non-deprecated function cudaDeviceSynchronize()  returns an error if one of the preceding tasks has failed • cudaDeviceSynchronize()  blocks until the device has completed all preceding requested tasks • The first one is deprecated because its name does not reflect its behavior
  • 9. RESULT • 16x16 (using shared memory with size of full image) • Serial version: 0.00656 ms • Parallel version: 0.197344 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 56.85 19.73us 1 19.73us 19.73us 19.73us computeByRow(float*, int, int) 25.17 8.73us 1 8.73us 8.73us 8.73us computeByColumn(float*, int, int) 12.54 4.35us 2 2.18us 2.18us 2.18us [CUDA memcpy DtoH] 5.44 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]
  • 10. RESULT (CONT.) • 640*480 • Using shared memory per line • Serial version: 5.11238 ms • Parallel version: 4.361386 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 66.36 2.18ms 1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int) 12.72 418.14us 2 209.07us 208.45us 209.70us [CUDA memcpy HtoD] 11.75 386.21us 2 193.10us 191.04us 195.17us [CUDA memcpy DtoH] 9.16 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)
  • 11. RESULT (CONT.) • 640*480 • Using segment image and shared memory • Serial version: 5.11238 ms • Parallel version: 70.0833 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 98.22 66.23ms 26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int) 0.69 467.46us 27 17.31us 9.79us 209.76us [CUDA memcpy HtoD] 0.64 429.60us 27 15.91us 8.93us 195.58us [CUDA memcpy DtoH] 0.45 301.18us 1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)
  • 12. CONCLUSION • The method doesn’t improve the performance • Find the new method to write the massive data from shared memory to the global memory