Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
Building and operating HPC-based AI computing environment inside Gwangju Institute of Science and Technology
For using the part of the slide, you need to cite "Narantuya Jargalsaikhan, GIST AI-X Computing Cluster, 2021".
Thank you!
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/ceva/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-siegel
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Yair Siegel, Director of Segment Marketing at CEVA, presents the "Fast Deployment of Low-power Deep Learning on CEVA Vision Processors" tutorial at the May 2016 Embedded Vision Summit.
Image recognition capabilities enabled by deep learning are benefitting more and more applications, including automotive safety, surveillance and drones. This is driving a shift towards running neural networks inside embedded devices. But, there are numerous challenges in squeezing deep learning into resource-limited devices. This presentation details a fast path for taking a neural network from research into an embedded implementation on a CEVA vision processor core, making use of CEVA’s neural network software framework. Siegel explains how the CEVA framework integrates with existing deep learning development environments like Caffe, and how it can be used to create low-power embedded systems with neural network capabilities.
A Whirlwind Tour of FamilySearch Resources - 2013 Presentationbakers84
This is presentation I gave at the 2013 BYU Conference on Family History and Genealogy. It gives a high-level overview of available resources on familysearch.org. A corresponding document with URL links to the pages shown in the presentation has also been uploaded to SlideShare.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
Building and operating HPC-based AI computing environment inside Gwangju Institute of Science and Technology
For using the part of the slide, you need to cite "Narantuya Jargalsaikhan, GIST AI-X Computing Cluster, 2021".
Thank you!
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/ceva/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-siegel
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Yair Siegel, Director of Segment Marketing at CEVA, presents the "Fast Deployment of Low-power Deep Learning on CEVA Vision Processors" tutorial at the May 2016 Embedded Vision Summit.
Image recognition capabilities enabled by deep learning are benefitting more and more applications, including automotive safety, surveillance and drones. This is driving a shift towards running neural networks inside embedded devices. But, there are numerous challenges in squeezing deep learning into resource-limited devices. This presentation details a fast path for taking a neural network from research into an embedded implementation on a CEVA vision processor core, making use of CEVA’s neural network software framework. Siegel explains how the CEVA framework integrates with existing deep learning development environments like Caffe, and how it can be used to create low-power embedded systems with neural network capabilities.
A Whirlwind Tour of FamilySearch Resources - 2013 Presentationbakers84
This is presentation I gave at the 2013 BYU Conference on Family History and Genealogy. It gives a high-level overview of available resources on familysearch.org. A corresponding document with URL links to the pages shown in the presentation has also been uploaded to SlideShare.
A Whirlwind Tour of FamilySearch Resources - 2012bakers84
I gave this presentation at the semi-annual meeting of the Thomas Tolman Family Organization in October 2012. It gives an overview of the many resources available at familysearch.org, including the FamilySearch Family Tree that will soon replace new.familysearch.
A Whirlwind Tour of FamilySearch Resources - 2013 URL Listbakers84
This is the syllabus materials for a presentation I gave at the 2013 BYU Conference on Family History and Genealogy. It gives a high-level overview of available resources on familysearch.org. This document gives links to many pages within familysearch.org. A corresponding presentation showing screenshots has also been uploaded to SlideShare.
Finding 'My Tree' Within FamilySearch Family Tree's 'Our Tree'bakers84
FamilySearch’s Family Tree is an important step forward in open collaboration with the ultimate goal of a single tree of all mankind. While a powerful paradigm, many crave better visibility into their portion of “Our Tree”. This presentation shows how existing features and new research can help uncover “My Tree” within the larger Family Tree.
I gave this presentation at the 2014 BYU Conference on Family History and Genealogy.
How to Get More Involved in Your Family History Despite Limited Available Timebakers84
You have the technological skills and desire to do more family history work, but aren’t sure where to start and don’t have much available time. Here are some ideas of simple an non-time consuming things that can be done to move the work forward such as Indexing and some basic features in FamilySearch's Family Tree like adding sources, correcting bad data, etc. Also shows a prototype system to better prioritize and visualize people in your family tree beyond just direct ancestors that may need LDS temple work, data correction, sourcing and so forth.
Presentation given at SORT 2012
Merging People in FamilySearch Family Tree - Presentationbakers84
This is a presentation originally given at the 2013 BYU Conference on Family History and Genealogy and again with updated content for a presentation at the 2017 BYU Conference on Family History. This presentation helps users become more familiar with and successful using the merge-related features of FamilySearch's Family Tree.
Help! My Family Is All Messed Up on FamilySearch Family Tree!bakers84
For various reasons including the origins of data and the collaborative nature of FamilySearch Family Tree, there are many situations where incorrect data may exist in your family tree. This presentation will help users to learn from an experienced FamilySearch engineer strategies to understand and resolve commonly seen bad data situations.
FamilySearch Insider Tips and Tricks - Presentationbakers84
There are many powerful tools available on FamilySearch.org. Many features of these tools are not well documented in manuals or easily discovered in the products themselves. This presentation shows some tips and tricks from a FamilySearch engineer to be more productive in using the resources on FamilySearch.org.
I gave this presentation at the 2014 BYU Conference on Family History and Genealogy. While there are some portions of the presentation that are not yet complete, I decided to upload the presentation as is and plan on updating it in the near future with additional information.
Start and Grow Your Family Tree on FamilySearch.org - Presentationbakers84
Presentation at 2016 RootsTech conference. Learn how anyone can use the FREE resources on FamilySearch.org to build their family tree in a collaborative, source-based manner.
Covers the following areas:
- What is FamilySearch Family Tree?
- What are the benefits of a public tree?
- How to navigate and add to the tree
- Basics on working with others on family tree
The Evolution of Technology and Family Historybakers84
Presentation I gave at the SORT Conference in 2011. A historical perspective of how family history has evolved over time using mileposts from my own life.
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)Budianto Tandianus
Was presented in GPU Technology Conference 2014 by Dr. Chen Quan.
The presentation recording and the definitive version of the slide can be downloaded from : http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=S4356&searchItems=session_id&submit=
Heterogeneous computing is seen as a path forward to deliver the energy and performance improvements needed over the next decade. That way, heterogeneous systems feature GPUs (Graphics Processing Units) or FPGAs (Field Programmable Gate Arrays) that excel at accelerating complex tasks while consuming less energy. There are also heterogeneous architectures on-chip, like the processors developed for mobile devices (laptops, tablets and smartphones) comprised of multiple cores and a GPU.
This talk covers hardware and software aspects of this kind of heterogeneous architectures. Regarding the HW, we briefly discuss the underlying architecture of some heterogeneous chips composed of multicores+GPU and multicores+FPGA, delving into the differences between both kind of accelerators and how to measure the energy they consume. We also address the different solutions to get a coherent view of the memory shared between the cores and the GPU or between the cores and the FPGA.
Introduction to GPUs for Machine LearningSri Ambati
Graphics processing units (GPUs) are becoming integral components of modern machine learning engines and platforms. These will provide an introduction to GPUs and their suitability for machine learning workloads. They also discuss enabling technologies, such as CUDA, and demonstrate GPU-accelerated machine learning with the H2O platform. These slides are targeted to machine learning practitioners new to GPUs.
Author: Wen Phan is a Senior Solutions Architect at H2O.ai. Wen works with customers and organizations to architect systems, smarter applications, and data products to make better decisions, achieve positive outcomes, and transform the way they do business. Internally, Wen uses his hard-earned field experiences, customer feedback, and market trends to drive product innovation and development. Wen holds a B.S. in Electrical Engineering and M.S. in Analytics and Decision Sciences.
Follow him on twitter: @wenphan
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
When working with big data or complex algorithms, we often look to parallelize our code to optimize runtime. By taking advantage of a GPUs 1000+ cores, a data scientist can quickly scale out solutions inexpensively and sometime more quickly than using traditional CPU cluster computing. In this webinar, we will present ways to incorporate GPU computing to complete computationally intensive tasks in both Python and R.
See the full presentation here: 👉 https://vimeo.com/153290051
Learn more about the Domino data science platform: https://www.dominodatalab.com
Leveraging the Consultant Planner - Presentationbakers84
Presentation for the 2018 BYU Conference on Family History and Genealogy. Intended to help LDS family history and temple consultants learn the basics of how to use the Consultant Planner to gain insights into others' families to make consultant visits more meaningful.
Leveraging the Consultant Planner Syllabusbakers84
Presentation for the 2018 BYU Conference on Family History and Genealogy. Intended to help family history and temple consultants learn the basics of how to use the Consultant Planner to gain insights into others' families to make consultant visits more meaningful.
A Peek Under the Hood at FamilySearch - Presentationbakers84
Presentation for the 2018 BYU Conference on Family History and Genealogy. This presentation gives an inside look from a FamilySearch engineer on how features are prioritized, developed, tested and released. Insight into how user feedback is received and propagated through the organization will also be presented.
The Coming Explosion of Records at FamilySearch - Presentationbakers84
Presentation given at the 2018 BYU Conference on Family History and Genealogy. While record hinting has greatly increased the number of record sources attached to persons in FamilySearch Family Tree, many records are still only available as images and are not yet indexed to be searchable. This is especially true for non-English records. This presentation shows how FamilySearch is working to provide more findable, relevant, curated records for gathering multi-generational families from around the world by using Artificial Intelligence (AI) and other cutting edge technologies to greatly accelerate the number of historical records available to patrons.
The Coming Explosion of Records at FamilySearch Syllabusbakers84
Syllabus for the 2018 BYU Conference on Family History and Genealogy. While record hinting has greatly increased the number of record sources attached to persons in FamilySearch Family Tree, many records are still only available as images and are not yet indexed to be searchable. This is especially true for non-English records. This presentation shows how FamilySearch is working to provide more findable, relevant, curated records for gathering multi-generational families from around the world by using Artificial Intelligence (AI) and other cutting edge technologies to greatly accelerate the number of historical records available to patrons.
A Peek Under the Hood at FamilySearch Syllabusbakers84
Presentation for the 2018 BYU Conference on Family History and Genealogy. This presentation gives an inside look from a FamilySearch engineer on how features are prioritized, developed, tested and released. Insight into how user feedback is received and propagated through the organization is also described.
Meaningful Family History In an Hour - Presentationbakers84
Presentation given at the 2017 BYU Conference on Family History and Genealogy. Help people, particularly LDS family history consultants, help others and themselves to be effective in doing family history in short periods of time.
FamilySearch Insider Tips and Tricks - Syllabusbakers84
There are many powerful tools available on FamilySearch.org. Many features of these tools are not well documented in manuals or easily discovered in the products themselves. This is an outline of what will be part of the accompanying presentation to show some tips and tricks from a FamilySearch engineer to be more productive in using the resources on FamilySearch.org.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
2. Moore’s Law
"The number of transistors incorporated in a chip
will approximately double every 24 months."
Gordon Moore, Intel Co-Founder
Originally published in 1965
3.
4. So What’s the Problem?
• Can continue to increase transistors per Moore’s Law
• Cannot continue to increase power or chips will melt
– Power steadily rose with new chips until ~2005 – now 1 volt
• Cannot continue to scale processor frequency
– Have you seen any 10 GHz chips?
Moore’s Law gave no prediction of
continued performance increases
5. Time to “Take the Leap”
“We have reached the limit of what is possible with
one or more traditional, serial central processing
units, or CPUs. It is past time for the computing
industry – and everyone who relies on it for
continued improvements in productivity, economic
growth and social progress – to take the leap into
parallel processing.”
Bill Dally - Chief Scientist at NVIDIA and Professor at Stanford University
http://www.forbes.com/2010/04/29/moores-law-computing-processing-opinions-contributors-bill-dally.html
6. Additional Resources
• Stanford course available on iTunes U
• http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322
– Programming Massively Parallel Processors with
CUDA
– Lectures 1 and 13 are great introductions
• Lecture 13 – The Future of Throughput Computing (Bill Dally)
• Lecture 1 – Introduction to Massively Parallel Computing
7. Guiding Principles
• Performance = Parallelism
– Single-threaded processor performance has flat-
lined at 0-5% annual growth since ~2005
• Efficiency = Locality
– Chips are power limited with most power spent
moving data around
8. Three Types of Parallelism
• Instruction-level parallelism
– Out of order execution, branch prediction, etc.
– Opportunities decreasing
• Data-level parallelism
– SIMD (Single Instruction Multiple Data), GPUs, etc.
– Opportunities increasing
• Thread-level parallelism
– Multithreading, multi-core CPUs, etc.
– Opportunities increasing
9. Taking the Leap
• Three things are required
– Lots of processors
– Efficienct memory storage
– Programming system that abstracts it
10. CPU VS. GPU ARCHITECTURE
CPU GPU
• General purpose • Special purpose
processors processors
• Optimized for • Optimized for data level
instruction level parallelism
parallelism • Many smaller processors
• A few large processors executing single
capable of multi- instructions on multiple
threading data (SIMD)
11. High Performance GPU Computing
• GPUs are getting faster more quickly than CPUs
• Being used in industry for weather simulation,
medical imaging, computational finance, etc.
• Amazon is now offering access to NVIDIA Tesla
GPUs in the cloud as a service ($ vs ¢ per hour)
• GPUs are being used as general purpose parallel
processors – http://gpgpu.org
12. Examples
• CUDA – NVIDIA
• C++ AMP – Microsoft
• OpenCL – Open source
• NPP – NVIDIA (Research done at FamilySearch)
13. CUDA
• Compute Unified Device Architecture
• Proprietary NVIDIA extensions to C for
running code on NVIDIA GPUs
• Other language bindings
– Java – jCUDA, JCuda, JCublas, JCufft
– Python – PyCUDA, KappaCUDA
– .NET – CUDAfy.NET, CUDA.NET
– Ruby – KappaCUDA
– More – Fortran, Perl, Mathematica, MATLAB, etc.
14. C for CUDA Example
// Compute vector sum c = a + b
// Each thread performs one pair-wise addition
__global__ void vector_add(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
{
int main()
{
// Allocate and initialize host (CPU) memory
float* hostA = …, *hostB = …;
// Allocate device (GPU) memory
cudaMalloc((void**) &deviceA, N * sizeof(float));
cudaMalloc((void**) &deviceB, N * sizeof(float));
cudaMalloc((void**) &deviceC, N * sizeof(float));
// Copy host memory to device
cudaMemcpy(deviceA, hostA, N * sizeof(float), cudaMemcpyHostToDevice));
cudaMemcpy(deviceB, hostB, N * sizeof(float), cudaMemcpyHostToDevice));
// Run N/256 blocks of 256 threads each
vector_add<<< N/256, 256>>>(deviceA, deviceB, deviceC);
}
15. Heterogeneous Computing with
Microsoft C++ AMP
• AMP = Accelerated Massive Parallelism
• Designed to take advantage of all the available compute
resources (CPU, integrated & discrete GPUs)
• Coming in the next version of Visual Studio and C++ in
the next year or two
• Cool demo
http://hothardware.com/News/Microsoft-Demos-C-AMP-Heterogeneous-Computing-at-AFDS/
16. EXAMPLE – C++ AMP
void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W)
{
for (int y = 0; y < M; y++) {
for (int x = 0; x < N; x++) {
float sum = 0;
for (int i = 0; i < W; i++)
sum += A(y*W + i] * B[i*N + x);
C[y*N + x] = sum;
}
}
}
void MatrixMult(float* C, const vector<float>&A, const vector<float>&B, int M, int N, int W)
{
array_view<const float, 2> a (M, W, A), b(W, N, B);
array_view<writeonly<float>, 2>c((M, N, C);
parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) {
float sum = 0;
for (int i = 0; i < a.x; i++)
sum += a(idx.y, i) * b(i, idx.x);
c[idx] = sum;
});
}
17. OpenCL
• Royalty free, cross-platform, vendor neutral
• Managed by Khronos OpenCL working group
(www.khronos.org/opencl)
• Design goal to use all computational resources
– GPUs and CPUs are peers
• Based on C
• Abstract the specifics of underlying hardware
18. Example – OpenCL
void trad_mul(int n, const float *a, const float* b, float* c)
{
for (int i = 0; i < n; i++)
c[i] = a[i] * b[i];
}
kernel void dp_mul(global const float *a, global const float* b, global float* c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
} // Execute over “n’ work-items
19. Image Processing Flow at FamilySearch
Preservation Storage
(Lossless JPEG-2000)
Image Capture
(Uncompressed TIFF) Image
Post-Processing
Microfilm Scanners (DPC)
Digital Cameras
Distribution
Storage
(JPEG - original size)
(JPEG - thumbnails)
20. Digital Processing Center (DPC)
• Collection of servers in a data center used by FamilySearch
to continuously process millions of images annually
• Image post processing operations performed include
– Automatic skew correction
– Automatic document cropping
– Image sharpening
– Image scaling (thumbnail creation)
– Encoding into other image formats
• CPU is a current bottleneck (~12 sec/image)
• Processing requirements continuously rising (number of
images, image size and number of color channels)
21. Computer Graphics vs.
Computer Vision
• Approximate inverses of each other:
– Computer graphics – converting “numbers into pictures”
– Computer vision – converting “pictures into numbers”
• GPUs have traditionally been used for computer
graphics – (Ex. Graphics intensive computer games)
• Recent research, hardware and software are using
GPUs for computer vision (Ex. Using Graphics
Devices in Reverse)
• GPUs generally work well when there is ample data-
level parallelism
22. IMPLEMENTATION OPTIONS
Rack Mount Servers Personal Supercomputer
• Several vendors provide solutions. • GPUs for computing can be placed in
(Ex. One is a 3U rack mount unit a standard workstation. Several
capable of holding 16 GPUs vendors provide solutions.
connected to 8 servers) • Each Tesla GPU requires
• “Compared to typical quad-core – Available double-wide PCIe slot
CPUs, Tesla 20 series computing – Two 6-pin or one 8-pin PCIe power
systems deliver equivalent connectors and sufficient wattage
performance at 1/10th the cost – Recommend 4GB RAM per card, at
and 1/20th the power least 2.33 GHz quad-core CPU and
consumption.” (NVIDIA) 64-bit Linux or Windows
• “250x the computing performance of
a standard workstation” (NVIDIA)
23. Image Processing Performance
with IPP and NPP
• FamilySearch currently uses Intel’s IPP
– Intel Performance Primitives
– Optimize operations on Intel CPUs
– Closed source, licensed
• NVIDIA has produced a similar library called NPP
– NVIDIA Performance Primitives
– Optimize operations on NVIDIA GPUs (CUDA underneath)
– Higher level abstraction to perform image processing on GPUs
– No license for SDK
24. EXAMPLE – NPP
// Declare a host object for an 8-bit grayscale image
npp::ImageCPU_8u_C1 hostSrc;
// Load grayscale image from disk
npp::loadImage(sFilename, hostSrc);
// Declare a device image and upload from host
npp::ImageNPP_8u_C1 deviceSrc(hostSrc);
… [Create padded image]
… [Create Gaussian kernel] … [Create padded image]
… [Create Gaussian kernel]
// Copy kernel to GPU
cudaMemcpy2D(deviceKernel, 12, hostKernel, kernelSize.width
* sizeof(Npp32s), kernelSize.width * sizeof(Npp32s),
kernelSize.height, cudaMemcpyHostToDevice);
// Allocate blurred image of appropriate size // Allocate blurred image of appropriate size (on GPU)
Ipp8u* blurredImg = ippiMalloc_8u_C1(img.getWidth(), npp::ImageNPP_8u_C1 deviceBlurredImg(imgSz.width,
img.getHeight(), &blurredImgStepSz); imgSz.height);
// Perform the filter // Perform the filter
ippiFilter32f_8u_C1R(paddedImgData, nppiFilter_8u_C1R(paddedImg.data(widthOffset,
paddedImage.getStepSize(), blurredImg, heightOffset), paddedImg.pitch(),
blurredImgStepSz, imgSz, kernel, kernelSize, deviceBlurredImg.data(), deviceBlurredImg.pitch(),
kernelAnchor); imgSz, deviceKernel, kernelSize, kernelAnchor,
divisor);
// Declare a host image for the result
npp::ImageCPU_8u_C1
hostBlurredImage(deviceBlurredImg.size());
// Copy the device result data into it
deviceBlurredImg.copyTo(hostBlurredImg.data(),
hostBlurredImg.pitch());
25. Performance Testing Methodology
• Test System Specifications
– Dual Quad Core Intel® Xeon® 2.80GHz i7 CPUs (8 cores
total)
– 6 GB RAM
– 64-bit Windows 7 operating system
– Single Tesla C1060 Compute Processor (240 processing cores
total)
– PCI-Express x16 Gen2 slot
• Three representative grayscale images of increasing size
– Small image – 1726 x 1450 (2.5 megapixels)
– Average image – 4808 x 3940 (18.9 megapixels)
– Large image – 8966 x 6132 (55.0 megapixels)
• Results for each image repeated 3 times and averaged
• Transfer time to/from the GPU is considered part of all
GPU operations
27. AMDAHL’S LAW
Speeding up 25% of an
overall process by 10x is
less of an overall
improvement than
speeding up 75% of an
overall process by 1.5x
28. Takeaways
• Significant performance increases can be realized through
parallelization – may become only way in the future
• GPUs are transforming into general purpose data-parallel
computational coprocessors and outstripping advances in multi-
core CPUs
• Languages, tools and APIs for parallel computing remain relatively
immature, but are improving rapidly
• Relatively small learning curve
– For image processing, NPP’s API nearly perfectly matches Intel’s IPP
– New paradigms around copying to/from GPU and allocating memory
– Can use programming languages familiar to developers without
understanding intricacies of GPU architectures
– Does require rethinking of algorithms to be parallel and building the
computation around the data
Editor's Notes
Don’t claim to be expert
Source of much of what I will present – gives a lot more details, coming from people who know a lot more than I do
Even CPUs realize performance is about parallelism – multi-core CPUsPower required increases exponentially with distance – Bill Dally says that lots of arithmetic units actually not hot
GPUs initially only for computer graphics acceleration
Of course want something that is open
Number of images increasing as is size, more color, etc.
Data center servers for large scale places like FamilySearch, Workstations could be put in smaller installations such as an archiveBased on limited survey (most sites don’t list prices)~$5-6K list price for 1U server or personal supercomputer w/2 Teslas~$8-9K list price for 1U server or personal supercomputer w/4 Teslas~$1200 per Tesla
NVIDIA directly going at IPPImaging library structured so that we could create implementation for GPUs to run on a single GPU based server concurrent with current system
Rotating, cropping, sharpening and scaling operations parallelized on GPU