This document discusses using GPUs for image processing instead of CPUs. It notes that GPUs have much higher peak performance than CPUs, growing from 5,000 triangles/second in 1995 to 350 million triangles/second in 2010. However, GPU programming is more complex than CPUs due to the different architecture and programming model. This can make it harder to implement algorithms on GPUs and to optimize for high efficiency. The document proposes a methodology for GPU acceleration including characterizing algorithms, estimating performance, using models like Roofline to analyze bottlenecks, and benchmarking. It also describes establishing a competence center to help others overcome the challenges of GPU programming.
GPU computing accelerates several computational chemistry applications. With GPUs the users don't need to make any code changes when running applications such as AMBER, NAMD, GROMACS or LAMMPS. All they need to do is run their models as they would run without GPUs to be able to speed up their simulations from days to hours. For a full list of GPU accelerated applications - http://goo.gl/IKmYs
GPU computing accelerates several computational chemistry applications. With GPUs the users don't need to make any code changes when running applications such as AMBER, NAMD, GROMACS or LAMMPS. All they need to do is run their models as they would run without GPUs to be able to speed up their simulations from days to hours. For a full list of GPU accelerated applications - http://goo.gl/IKmYs
PEER 1 Offers NVIDIA GPU to Accelerate High Performance Applications
PEER 1 has teamed up with NVIDIA the creator of the GPU and a world leader in visual computing, to provide high performance GPU Cloud applications. NVIDIA’s GPUs are well known for making customer software run faster and PEER 1 is offering a number of services that run on NVIDA’s GPUs. PEER 1’s cloud service is built on NVIDIA Telsa GPU’s delivering supercomputing performance in the cloud to solve much tougher problems. Click here to find out how PEER 1 and NVIDIA can transform your business.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
PEER 1 Offers NVIDIA GPU to Accelerate High Performance Applications
PEER 1 has teamed up with NVIDIA the creator of the GPU and a world leader in visual computing, to provide high performance GPU Cloud applications. NVIDIA’s GPUs are well known for making customer software run faster and PEER 1 is offering a number of services that run on NVIDA’s GPUs. PEER 1’s cloud service is built on NVIDIA Telsa GPU’s delivering supercomputing performance in the cloud to solve much tougher problems. Click here to find out how PEER 1 and NVIDIA can transform your business.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
Dan Wittenberg's presentation on using Nagios at a Fortune 50 Company
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
Computing Performance: On the Horizon (2021)Brendan Gregg
Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Similar to iMinds The Conference: Jan Lemeire (20)
Start-ups are major engines of economic development, yet they often lack research capacity to solve their key technical innovation challenges. Discover how iMinds arms digital start-ups with the “R” in the R&D equation.
1. GPU
acceleration of
image
processing Jan
Lemeire
1
15/11/2012
2.
3. GPU vs CPU Peak Performance Trends
2010
350 Million triangles/second
GPU peak performance has grown aggressively.
3 Billion transistors GPU
Hardware has kept up with Moore’s law
1995
5,000 triangles/second
800,000 transistors GPU
Source : NVIDIA 3
4. To the rescue: Graphical Processing Units
(GPUs)
Many-core GPU
94 fps (AMD Tahiti Pro)
GPU: 1-3 TeraFlop/second Multi-core CPU
instead of 10-20 GigaFlop/second for CPU
Courtesy: John Owens
Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs.
15/11/2012 4
5.
6. GPUs
are an alternative for CPUs
in offering processing power
15/11/2012 6
7. pixel rescaling lens correction pattern detection
CPU gives only 4 fps
next generation machines need 50fps
15/11/2012 7
9. Methodology
Application
Identification of
compute-intensive parts
Feasibility study of
GPU acceleration
GPU implementation
GPU optimization
Hardware
15/11/2012 9
10. Obstacle 1
Hard(er) to implement
15/11/2012 10
11. GPU Programming Concepts
Device/GPU 1TFLOPS
Grid (1D, 2D or 3D)
kernel
Multiprocessor 1 Multiprocessor 2
get_local_size(0)
get_local_size(1)
Local Memory (16/48KB) Local Memory Group Group Group
(0, 0) (1, 0) (2, 0)
40GB/s few cycles
Private Private Private Private
Host/ Group Group Group
16K/8
CPU (0, 1) (1, 1) (2, 1)
Scalar
Scalar Scalar Scalar
Processor
Proces- Processor Processor Processor
1GHz
sor
100GB/s 200 cycles Work group
Work group size Sy
R Global Memory (1GB) (get_group_id(0),get_group_id(1))
A Work item Work item Work item
M Constant Memory (64KB) (0, 0) (1, 0) (2, 0)
Work item Work item Work item
(0, 1) (1, 1) (2, 1)
Texture Memory (in global memory)
Work item Work item Work item
4-8 GB/s (0, 2) (1, 2) (2, 2)
Work group size Sx
Max #work items per work group: 1024 (get_local_id(0), get_local_id(1))
Executed in warps/wavefronts of 32/64 work items
Max work groups simultaneously on MP: 8
Max active warps on MP: 24/48
15/11/2012 OpenCL terminology
11
12. Semi-abstract scalable hardware model
Need to know more Code remains
details than of CPU compatible/efficient
Need to know model for effective and efficient
code
CPU: processor ensures efficient execution
15/11/2012 12
13. Increased code complexity
1. Complex index calculations
Mapping data elements on processing elements (at
least 2 levels)
Sometimes better to group elements
2. Optimizations
Impact on performance need to be tested
3. A lot of parameters:
a. Algorithm, implementation
b. Configuration of mapping
c. Hardware parameters (limits)
d. Optimized versions
15/11/2012 13
14. Methodology
Application
Identification of
compute-intensive parts
Parallelization by
compiler
Feasibility study of
GPU acceleration
Pragma-based
Skeleton-based GPU implementation
OpenCL
GPU optimization
Hardware
15/11/2012 14
15. Obstacle 2
Hard(er) to get efficiency
15/11/2012 15
16. We expect peak performance
Speedup of 100x possible
At least, we expect some speedup
But what is 5x worth?
Reasons for low efficiency?
15/11/2012 16
23. Competence Center for Personal
Supercomputing
Offer trainings (overcome obstacle 1)
Acquire expertise
Take an independent, critical position
Offer feasibility and performance studies
(overcome obstacle 2)
Symposium: Brussels, December 13th 2012
http://parallel.vub.ac.be
15/11/2012 23
Editor's Notes
First, we have to understand where it comes from, the tremendous computational power of GPU. The CPU is capable of running a(ny) sequential program very fast. The GPU has a lot of processing units, but programming them requires more care.Map part of the computational work on processing elementDescribe by kernelKernel executed by a ‘thread’E.g. image processing: pixel is work unit
Case of KLA Tencor (ICOS – Leuven): inspection machines needing real-time image processing
Re-implementation of algorithms is required…
On the left the abstract hardware model and on the right the execution model. Both should be understood in order to write OpenCL programs. This contrasts with the simple Von Neumann model used for CPUs.
Our focus is on OpenCL programming and not high-level solutions that generate GPU programs. Those solutions are, in my opinion, not mature yet.
Is 5x worth the effort of porting to GPUs?
Roofline model gives which resource bounds the overall performance
After each waterfall follows calm water, but you have to accept the turbulences first.And you don’t know when you’re out of trouble.
After each waterfall follows calm water, but you have to accept the turbulences first.And you don’t know when you’re out of trouble.