This document outlines Andreas Klockner's presentation on GPU programming in Python using PyOpenCL and PyCUDA. The presentation covers an introduction to OpenCL, programming with PyOpenCL, run-time code generation, and perspectives on GPU programming in Python. OpenCL provides a common programming framework for heterogeneous parallel programming across CPUs, GPUs, and other processors. PyOpenCL and PyCUDA allow GPU programming from Python.
OpenCV DNN moduleとOur methodのruntimeを比較したスライドで、13th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services(MOBIQUITOUS)(http://mobiquitous.org/2016/show/home) のworkshopで発表したスライドの一部になっています。画像認識部分の詳細は省略しました。
OpenCV DNN moduleとOur methodのruntimeを比較したスライドで、13th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services(MOBIQUITOUS)(http://mobiquitous.org/2016/show/home) のworkshopで発表したスライドの一部になっています。画像認識部分の詳細は省略しました。
It's 2021 and containerization has been happening for 7 years already.
In the Java space, there are several ways to package a Java application as a Docker image.
Let's discover them from the Dockerfile to the CNCF Buildpacks, mentioning the Jib way too!
CloudNative Days Spring 2021 ONLINE キーノートでの発表資料です。
https://event.cloudnativedays.jp/cndo2021/talks/1071
本セッションでは、DockerとKubernetesのもつ基本的な機能の概要を、コンテナの仕組みをふまえつつイラストを用いて紹介していきます。一般にあまり焦点をあてて取り上げられることは多くありませんが、コンテナの作成や管理を担う低レベルなソフトウェア「コンテナランタイム」も本セッションの中心的なトピックのひとつです。
本セッションは、拙著「イラストで分かるDockerとKubernetes」(技術評論社)の内容を参考にしています。
https://www.amazon.co.jp/dp/4297118378
P2P Container Image Distribution on IPFS With containerd and nerdctlKohei Tokunaga
Talked at FOSDEM 2022 about IPFS-based P2P image distribution with containerd and nerdctl (Feburary 6, 2022).
https://fosdem.org/2022/schedule/event/container_ipfs_image/
nerdctl is a Docker-compatible CLI of containerd, developed as a subproject of containerd. nerdctl recently added support of P2P image distribution on IPFS. This enables to share container images among hosts without hosting or relying on the registry.
In this session, Kohei, one of the maintainers of nerdctl, will introduce IPFS-based P2P image distribution with containerd and nerdctl. This session will also show the combination of IPFS-based distribution with the existing image distribution techniques, focusing on lazy pulling (eStargz) and image encryption (OCIcrypt). The status of integration work with other tools including Kubernetes will also be shared.
Related blog post: "P2P Container Image Distribution on IPFS With Containerd" . https://medium.com/nttlabs/nerdctl-ipfs-975569520e3d
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources (ReConFig2014@Cancun, Mexico)
flipSyrup, a new framework for rapid prototyping is proposed.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
Presentation slide for CARL2013 (Co-located with MICRO-46) at Davis, CA.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Startup Containers in Lightning Speed with Lazy Image DistributionKohei Tokunaga
Talked about lazy container image distribution technologies including containerd + Stargz Snapshotter ( https://github.com/containerd/stargz-snapshotter ) at KubeCon+CloudNativeCon Europe 2020 Virtual.
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
Engineering software is widely employed for its powerful abstraction of scientific and technical knowledge. It enables productive applications, e.g., analysis, prototyping, and manufacturing. Making engineering software requires a profound understanding in the problem domain, as well as the art of engineering it.
Software engineering differs substantially from conventional engineering. To professionally build software, mathematicians, scientists, and engineers need skills including system administration, automatic build, automatic testing, version control, to name but a few. Computer science knowledge like algorithms and data structures is also indispensable. It is a joyful, interdisciplinary, and world-changing enterprise worth sharing with all future engineering practitioners.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
It's 2021 and containerization has been happening for 7 years already.
In the Java space, there are several ways to package a Java application as a Docker image.
Let's discover them from the Dockerfile to the CNCF Buildpacks, mentioning the Jib way too!
CloudNative Days Spring 2021 ONLINE キーノートでの発表資料です。
https://event.cloudnativedays.jp/cndo2021/talks/1071
本セッションでは、DockerとKubernetesのもつ基本的な機能の概要を、コンテナの仕組みをふまえつつイラストを用いて紹介していきます。一般にあまり焦点をあてて取り上げられることは多くありませんが、コンテナの作成や管理を担う低レベルなソフトウェア「コンテナランタイム」も本セッションの中心的なトピックのひとつです。
本セッションは、拙著「イラストで分かるDockerとKubernetes」(技術評論社)の内容を参考にしています。
https://www.amazon.co.jp/dp/4297118378
P2P Container Image Distribution on IPFS With containerd and nerdctlKohei Tokunaga
Talked at FOSDEM 2022 about IPFS-based P2P image distribution with containerd and nerdctl (Feburary 6, 2022).
https://fosdem.org/2022/schedule/event/container_ipfs_image/
nerdctl is a Docker-compatible CLI of containerd, developed as a subproject of containerd. nerdctl recently added support of P2P image distribution on IPFS. This enables to share container images among hosts without hosting or relying on the registry.
In this session, Kohei, one of the maintainers of nerdctl, will introduce IPFS-based P2P image distribution with containerd and nerdctl. This session will also show the combination of IPFS-based distribution with the existing image distribution techniques, focusing on lazy pulling (eStargz) and image encryption (OCIcrypt). The status of integration work with other tools including Kubernetes will also be shared.
Related blog post: "P2P Container Image Distribution on IPFS With Containerd" . https://medium.com/nttlabs/nerdctl-ipfs-975569520e3d
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources (ReConFig2014@Cancun, Mexico)
flipSyrup, a new framework for rapid prototyping is proposed.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
Presentation slide for CARL2013 (Co-located with MICRO-46) at Davis, CA.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing
"This deck is from the opening session of the "Introduction to Programming Pascal (P100) with CUDA 8" workshop at CSCS in Lugano, Switzerland. The three-day course is intended to offer an introduction to Pascal computing using CUDA 8."
Watch the video: http://wp.me/p3RLHQ-gsQ
Learn more: http://www.cscs.ch/events/event_detail/index.html?tx_seminars_pi1%5BshowUid%5D=155
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Startup Containers in Lightning Speed with Lazy Image DistributionKohei Tokunaga
Talked about lazy container image distribution technologies including containerd + Stargz Snapshotter ( https://github.com/containerd/stargz-snapshotter ) at KubeCon+CloudNativeCon Europe 2020 Virtual.
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
Engineering software is widely employed for its powerful abstraction of scientific and technical knowledge. It enables productive applications, e.g., analysis, prototyping, and manufacturing. Making engineering software requires a profound understanding in the problem domain, as well as the art of engineering it.
Software engineering differs substantially from conventional engineering. To professionally build software, mathematicians, scientists, and engineers need skills including system administration, automatic build, automatic testing, version control, to name but a few. Computer science knowledge like algorithms and data structures is also indispensable. It is a joyful, interdisciplinary, and world-changing enterprise worth sharing with all future engineering practitioners.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group, presents the "Vision API Maze: Options and Trade-offs" tutorial at the May 2016 Embedded Vision Summit.
It’s been a busy year in the world of hardware acceleration APIs. Many industry-standard APIs, such as OpenCL and OpenVX, have been upgraded, and the industry has begun to adopt the new generation of low-level, explicit GPU APIs, such as Vulkan, that tightly integrate graphics and compute. Some of these APIs, like OpenVX and OpenCV, are vision-specific, while others, like OpenCL and Vulkan, are general-purpose. Some, like CUDA and Renderscript, are supplier-specific, while others are open standards that any supplier can adopt. Which ones should you use for your project?
In this presentation, Neil Trevett, President of the Khronos Group standards organization, updates the landscape of APIs for vision software development, explaining where each one fits in the development flow. Neil also highlights where these APIs overlap and where they complement each other, and previews some of the latest developments in these APIs.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-trevett
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group and Vice President at NVIDIA, presents the "APIs for Accelerating Vision and Inferencing: Options and Trade-offs" tutorial at the May 2018 Embedded Vision Summit.
The landscape of SDKs, APIs and file formats for accelerating inferencing and vision applications continues to rapidly evolve. Low-level compute APIs, such as OpenCL, Vulkan and CUDA are being used to accelerate inferencing engines such as OpenVX, CoreML, NNAPI and TensorRT. Inferencing engines are being fed via neural network file formats such as NNEF and ONNX. Some of these APIs, like OpenCV, are vision-specific, while others, like OpenCL, are general-purpose. Some engines, like CoreML and TensorRT, are supplier-specific, while others, such as OpenVX, are open standards that any supplier can adopt. Which ones should you use for your project?
In this presentation, Trevett presents the current landscape of APIs, file formats and SDKs for inferencing and vision acceleration, explaining where each one fits in the development flow. Trevett also highlights where these APIs overlap and where they complement each other, and previews some of the latest developments in these APIs.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/01/khronos-standard-apis-for-accelerating-vision-and-inferencing-a-presentation-from-the-khronos-group/
Neil Trevett, President of the Khronos Group and Vice President of Developer Ecosystems at NVIDIA, presents the “Khronos Standard APIs for Accelerating Vision and Inferencing” tutorial at the September 2020 Embedded Vision Summit.
The landscape of processors and tools for accelerating inferencing and vision applications continues to evolve rapidly. Khronos standards, such as OpenCL, OpenVX, SYCL and NNEF, play an increasingly central role in connecting application developers to the latest silicon—productively, efficiently and portably.
In this talk, Trevett provides an overview and the latest updates on Khronos standards relevant for machine learning and computer vision, and previews how they are likely to evolve in the future.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2015-member-meeting-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of Khronos and Vice President at NVIDIA, delivers the presentation, "Update on Khronos Open Standard APIs for Vision Processing," at the December 2015 Embedded Vision Alliance Member Meeting. Trevett provides an update on recent developments in multiple Khronos standards useful for vision applications.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers a complete schedule of upcoming events, using OpenACC for a biophysics problem, HPC Summit Digital, overview of the SDSC GPU Hackathon, OmpSs-2 programming model, new resources and more!
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...OpenNebula Project
We've made our way into the world of open cloud — where each organization can find the right cloud for its unique needs. A single cloud management platform cannot be all things to all people. There will be a cloud space with several offerings focused on different environments and/or industries. The OpenNebula commitment to the open cloud is at the very base of its mission — to become the simplest cloud enabling platform — and its purpose — to bring simplicity to the private and hybrid enterprise cloud. OpenNebula exists to help companies build simple, cost-effective, reliable, open enterprise clouds on existing IT infrastructure. The OpenNebula Conference will be a great opportunity to communicate and share our vision and commitment, to look back at how the project has grown in the last 9 years, and to shed some insight into what to expect from the project in the near future.
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
Stay up-to-date with the OpenACC and Open Hackathons Monthly Highlights. July’s edition covers the 2022 OpenACC and Hackathons Summit, NVIDIA’s Applied Research Accelerator Program, upcoming Open Hackathons and Bootcamps, recent research, new resources, and more!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
HKG15-110: ODP Project Update
---------------------------------------------------
Speaker: Bill Fischofer
Date: February 9, 2015
---------------------------------------------------
★ Session Summary ★
This session provides a summary of ODP activities since LCU ‰Û÷14 and highlights the main features of ODP v1.0 for applications as well as the validations used by conforming ODP implementation.
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250771
Video: https://www.youtube.com/watch?v=xABcGPOCOuU
Etherpad: http://pad.linaro.org/p/hkg15-110
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2016-member-meeting-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Peter McGuinness, representing the Khronos Group, delivers the presentation "New Standards for Embedded Vision and Neural Networks" at the December 2016 Embedded Vision Alliance Member Meeting. McGuinness discusses new standardization work for embedded neural network and vision software.
ScicomP 2015 presentation discussing best practices for debugging CUDA and OpenACC applications with a case study on our collaboration with LLNL to bring debugging to the OpenPOWER stack and OMPT.
We describe ocl, a Python library built on top of pyOpenCL and numpy. It allows programming
GPU devices using Python. Python functions which are marked up using the provided
decorator, are converted into C99/OpenCL and compiled using the JIT at runtime. This
approach lowers the barrier to entry to programming GPU devices since it requires only
Python syntax and no external compilation or linking steps. The resulting Python program runs
even if a GPU is not available. As an example of application, we solve the problem of
computing the covariance matrix for historical stock prices and determining the optimal
portfolio according to Modern Portfolio Theory
We describe ocl, a Python library built on top of pyOpenCL and numpy. It allows programming GPU devices using Python. Python functions which are marked up using the provided decorator, are converted into C99/OpenCL and compiled using the JIT at runtime. This approach lowers the barrier to entry to programming GPU devices since it requires only
Python syntax and no external compilation or linking steps. The resulting Python program runs
even if a GPU is not available. As an example of application, we solve the problem of computing the covariance matrix for historical stock prices and determining the optimal portfolio according to Modern Portfolio Theory.
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
MIT 6.870 Object Recognition and Scene Understanding (Fall 2008)
http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
This class will review and discuss current approaches to object recognition and scene understanding in computer vision. The course will cover bag of words models, part based models, classifier based models, multiclass object recognition and transfer learning, concurrent recognition and segmentation, context models for object recognition, grammars for scene understanding and large datasets for semi supervised and unsupervised discovery of object and scene categories. We will be reading a mixture of papers from computer vision and influential works from cognitive psychology on object and scene recognition.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins), Mike Houston (Stanford) and NVIDIA.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Unit 8 - Information and Communication Technology (Paper I).pdf
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)
1. Intro PyOpenCL RTCG Perspectives
Easy, Effective, Efficient:
GPU Programming in Python
with PyOpenCL and PyCUDA
Andreas Kl¨ckner
o
Courant Institute of Mathematical Sciences
New York University
March 31, 2011
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
2. Intro PyOpenCL RTCG Perspectives
Thanks
Jan Hesthaven (Brown)
Tim Warburton (Rice)
Leslie Greengard (NYU)
PyOpenCL, PyCUDA contributors
Nvidia Corp., AMD Corp.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
3. Intro PyOpenCL RTCG Perspectives
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
4. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 Introduction
A Common Theme
Intro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
5. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 Introduction
A Common Theme
Intro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
6. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
How are High-Performance Codes constructed?
“Traditional” Construction of
High-Performance Codes:
C/C++/Fortran
Libraries
“Alternative” Construction of
High-Performance Codes:
Scripting for ‘brains’
GPUs for ‘inner loops’
Play to the strengths of each
programming environment.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
7. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 Introduction
A Common Theme
Intro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
8. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
9. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,
Big deal?
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
10. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]
Big deal!
Device-neutral (Nv GPU, AMD GPU,
Big deal?
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
14. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
CL vs CUDA side-by-side
CUDA source code: OpenCL source code:
global void transpose( void transpose(
float ∗A t, float ∗A, global float ∗a t, global float ∗a,
int a width, int a height ) unsigned a width, unsigned a height )
{ {
int base idx a = int base idx a =
blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE +
blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE;
int base idx a t = int base idx a t =
blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE +
blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE;
int glob idx a = int glob idx a =
base idx a + threadIdx.x base idx a + get local id (0)
+ a width ∗ threadIdx.y; + a width ∗ get local id (1);
int glob idx a t = int glob idx a t =
base idx a t + threadIdx.x base idx a t + get local id (0)
+ a height ∗ threadIdx .y; + a height ∗ get local id (1);
shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1];
A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] =
A[ glob idx a ]; a[ glob idx a ];
syncthreads (); barrier (CLK LOCAL MEM FENCE);
A t[ glob idx a t ] = a t [ glob idx a t ] =
A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)];
} }
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
15. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL ↔ CUDA: A dictionary
OpenCL CUDA
Grid Grid
Work Group Block
Work Item Thread
kernel global
global device
local shared
private local
imagend t texture<type, n, ...>
barrier(LMF) syncthreads()
get local id(012) threadIdx.xyz
get group id(012) blockIdx.xyz
get global id(012) – (reimplement)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
16. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Execution Model
nD Grid
Group Group Group
(0, 0) (1, 0) (2, 0)
Two-tiered Parallelism
Group Group Group
(0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups
Work group = Sx × Sy × Sz work items
Total: i∈{x,y ,z} Si Ni work items
Work Group (1, 0) Comm/Sync only within work group
Item Item Item Item Work group maps to compute unit
(0, 0) (1, 0) (2, 0) (3, 0)
Grid/Group ≈ outer loops in an algorithm
Item Item Item Item
(0, 1) (1, 1) (2, 1) (3, 1)
Device Language:
Item
(0, 2)
Item
(1, 2)
Item
(2, 2)
Item
(3, 2)
get {global,group,local} {id,size}
Item Item Item Item
(axis)
(0, 3) (1, 3) (2, 3) (3, 3)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
17. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host
(CPU)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
18. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
19. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
Memory ···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
20. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Memory
Compute Device 1 (Platform 0)
···
Host ···
···
Memory
Compute Device 0 (Platform 1)
(CPU)
···
Memory ···
···
Memory
Compute Device 1 (Platform 1)
···
···
···
Memory
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
21. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
22. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Platform 0 (e.g. CPUs)
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
23. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Platform 1 (e.g. GPUs)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
24. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
25. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
(think “chip”,
Compute Device 0 (Platform 0)
has memory
···
interface) ···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
26. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
(think “chip”,
Compute Device 0 (Platform 0)
has memory
···
interface) ···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Compute Unit ···
···
(think “processor”, ···
has insn. fetch)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
27. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
(think “chip”,
Compute Device 0 (Platform 0)
has memory
···
interface) ···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Compute Unit ···
···
(think “processor”, ···
has insn. fetch)
Processing Element
(think “SIMD lane”)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
28. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
29. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Python ···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
30. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Python ···
···
···
Device Language: ∼ C99
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
31. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL Object Diagram
Figure 2.1 - OpenCL UML Class Diagram
Credit: Khronos Group
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
32. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
33. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCL
First Contact
About PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
34. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCL
First Contact
About PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
35. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
1 import pyopencl as cl , numpy
2
3 a = numpy.random.rand(256∗∗3).astype(numpy.float32)
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get global id (0)] ∗= 2; }
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (1,), a dev)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
36. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
1 import pyopencl as cl , numpy
2
3 a = numpy.random.rand(256∗∗3).astype(numpy.float32)
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get global id (0)] ∗= 2; } Compute kernel
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (1,), a dev)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
37. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; }
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (256,), a dev)
17
18 result = numpy.empty like(a)
19 cl . enqueue read buffer (queue, a dev, result ). wait()
20 import numpy.linalg as la
21 assert la .norm(result − 2∗a) == 0
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
38. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCL
First Contact
About PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
39. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Completeness
PyOpenCL exposes all of OpenCL.
For example:
Every GetInfo() query
Images and Samplers
Memory Maps
Profiling and Synchronization
GL Interop
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
40. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Completeness
PyOpenCL supports (nearly)
every OS that has an OpenCL
implementation.
Linux
OS X
Windows
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
41. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Automatic Cleanup
Reachable objects (memory,
streams, . . . ) are never destroyed.
Once unreachable, released at an
unspecified future time.
Scarce resources (memory) can be
explicitly freed. (obj.release())
Correctly deals with multiple
contexts and dependencies. (based
on OpenCL’s reference counting)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
42. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Documentation
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
43. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errors
automatically
Integrate tightly with numpy
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
44. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL, PyCUDA: Vital Information
http://mathema.tician.de/
software/pyopencl (or /pycuda)
Complete documentation
X Consortium License
(no warranty, free for all use)
Convenient abstractions
Arrays, Elementwise op., Reduction, Scan
Require: numpy, Python 2.4+
(Win/OS X/Linux)
Community: mailing list, wiki, add-on
packages (FFT, scikits.cuda, . . . )
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
45. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Capturing Dependencies
A
f
B = f(A) B
C = g(B) g p
E = f(C)
C P q
F = h(C)
G = g(E,F) f h
P = p(B) E F Q
Q = q(B)
g g r
R = r(G,P,Q)
G r
r
R
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
46. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Capturing Dependencies
A
Switch queue to out-of-order
mode! f
B = f(A)
Specify as list of events using B
C = g(B) for= optional keyword to
wait g p
E = f(C)
enqueue XXX.
C P q
F = h(C) also enqueue barrier.
Can
G = g(E,F) f h
Common use case:
P = p(B)
Transmit/receive from other MPI E F Q
Q = q(B)
ranks. g g r
R = r(G,P,Q)
Possible on Nv Fermi: Submit G r
parallel work to increase machine
r
use.
R
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
47. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
The Idea
RTCG in Action
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
48. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
The Idea
RTCG in Action
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
49. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
50. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is data–it wants to be
reasoned about at run time)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
51. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is data–it wants to be
reasoned about at run time)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
52. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
53. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary Machine
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
54. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Human In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
55. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Good for code In GPU scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
56. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Good for code In GPUyCUDA
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
57. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Good for code PyOp UDA
In GPUyCenCL
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
58. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Machine-generated Code
Why machine-generate code?
Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
(→ register pressure)
Loop Unrolling
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
59. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
PyOpenCL: Support for Metaprogramming
Three (main) ways of generating code:
Simple %-operator substitution
Combine with C preprocessor: simple, often sufficient
Use a templating engine (Mako works very well)
codepy:
Build C syntax trees from Python
Generates readable, indented C
Many ways of evaluating code–most important one:
Exact device timing via events
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
60. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
The Idea
RTCG in Action
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
61. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
PyOpenCL Arrays: General Usage
Remember your first PyOpenCL program?
Abstraction is good:
1 import numpy
2 import pyopencl as cl
3 import pyopencl.array as cl array
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a gpu = cl array . to device (
9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))
10 a doubled = (2∗a gpu).get()
11 print a doubled
12 print a gpu
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
62. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.array: Simple Linear Algebra
pyopencl.array.Array:
Meant to look and feel just like numpy.
p.a.to device(ctx, queue, numpy array)
numpy array = ary.get()
+, -, ∗, /, fill, sin, arange, exp, rand, . . .
Mixed types (int32 + float32 = float64)
print cl array for debugging.
Allows access to raw bits
Use as kernel arguments, memory maps
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
63. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.elementwise: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
n = 10000
a gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))
b gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))
from pyopencl.elementwise import ElementwiseKernel
lin comb = ElementwiseKernel(ctx,
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)
c gpu = cl array . empty like (a gpu)
lin comb(5, a gpu, 6, b gpu, c gpu)
import numpy.linalg as la
assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
64. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
RTCG via Substitution
source = (”””
kernel void %(name)s(%(arguments)s)
{
unsigned lid = get local id (0);
unsigned gsize = get global size (0);
unsigned work item start = get local size (0)∗ get group id (0);
for (unsigned i = work item start + lid ; i < n; i += gsize)
{
%(operation)s;
}
}
””” % {
”arguments”: ”, ”. join (arg . declarator () for arg in arguments),
”operation”: operation ,
”name”: name,
”loop prep”: loop prep ,
})
prg = cl.Program(ctx, source ). build ()
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
65. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
RTCG via Templates
from mako.template import Template
tpl = Template(”””
kernel void add(
global ${ type name } ∗tgt,
global const ${ type name } ∗op1,
global const ${ type name } ∗op2)
{
int idx = get local id (0)
+ ${ local size } ∗ ${ thread strides }
∗ get group id (0);
% for i in range( thread strides ):
<% offset = i∗ local size %>
tgt [ idx + ${ offset }] =
op1[idx + ${ offset }]
+ op2[idx + ${ offset } ];
% endfor
}”””)
rendered tpl = tpl . render(type name=”float”,
local size = local size , thread strides = thread strides )
knl = cl.Program(ctx, str ( rendered tpl )). build (). add
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
66. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.reduction: Reduction made easy
Example: A dot product calculation
from pyopencl.reduction import ReductionKernel
dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
arguments=” global const float ∗x, global const float ∗y”)
import pyopencl.clrandom as cl rand
x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
x dot y = dot(x, y ). get()
x dot y cpu = numpy.dot(x.get(), y. get ())
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
67. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.scan: Scan made easy
Example: A cumulative sum computation
from pyopencl.scan import InclusiveScanKernel
knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)
n = 2∗∗20−2∗∗18+5
host data = np.random.randint(0, 10, n). astype(np.int32)
dev data = cl array . to device (queue, host data)
knl(dev data)
assert (dev data.get() == np.cumsum(host data, axis=0)).all()
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
68. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
69. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
70. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 import pycuda.driver as cuda
2 import pycuda.autoinit , pycuda.compiler
3 import numpy
4
5 a = numpy.random.randn(4,4).astype(numpy.float32)
6 a gpu = cuda.mem alloc(a.nbytes)
7 cuda.memcpy htod(a gpu, a)
[This is examples/demo.py in the PyCUDA distribution.]
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
71. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( float ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 }
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
72. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( float ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 } Compute kernel
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
73. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
PyOpenCL ↔ PyCUDA: A (rough) dictionary
PyOpenCL PyCUDA
Context Context
CommandQueue Stream
Buffer mem alloc / DeviceAllocation
Program SourceModule
Kernel Function
Event (eg. enqueue marker) Event
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
74. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
75. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω := i Dk ⊂ Rd .
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
76. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω := i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut + · F (u) = 0
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
77. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω := i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut + · F (u) = 0
Example
Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by
1 j 1
∂t E − ×H =− , ∂t H + × E = 0,
ε ε µ
ρ
·E = , · H = 0.
ε
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
78. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms
ˆ ˆ
0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx
n n
Dk ∂Dk
Flux term
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
79. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms
ˆ ˆ
0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx
n n
Dk ∂Dk
Flux term
Flux terms:
vary by problem
expression specified by user
evaluated pointwise
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
80. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
81. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2
User writes: Vectorial statement in math. notation
flux = 1/2∗cross(normal, h. int −h.ext
−alpha∗cross(normal, e. int −e.ext))
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
82. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2
We generate: Scalar evaluator in C (6×)
a flux += (
((( val a field5 − val b field5 )∗ fpair −>normal[2]
− ( val a field4 − val b field4 )∗ fpair −>normal[0])
+ val a field0 − val b field0 )∗ fpair −>normal[0]
− ((( val a field4 − val b field4 ) ∗ fpair −>normal[1]
− ( val a field1 − val b field1 )∗ fpair −>normal[2])
+ val a field3 − val b field3 ) ∗ fpair −>normal[1]
)∗ value type (0.5);
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
83. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for element-local parts of GPU DG
Per Block: KL element-local mat.mult. + matrix load
Preparation
Question: How should one assign work to threads?
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
84. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for element-local parts of GPU DG
Per Block: KL element-local mat.mult. + matrix load
Preparation
Question: How should one assign work to threads?
ws : in sequence wi : “inline-parallel” wp : in parallel
Thread Thread Thread
t t t
(amortize preparation) (exploit register space)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
85. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for Differentiation
2.2 3.0
Local differentiation, matrix-in-shared,
order 4, with microblocking 2.8
2.0 point size denotes wi ∈ 1, ,4
2.6
1.8 2.4
Execution time [ms]
1.6 2.2
2.0
ws
1.4 1.8
1.2 1.6
1.4
1.0
1.2
0.8 15 20 25 30 1.0
wp
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
86. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400
300
GPU
250 CPU
200
GFlops/s
150
100
50
00 2 4 6 8 10
Polynomial Order N
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
87. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Memory Bandwidth on a GTX 280
200
Gather
180 Lift
Global Memory Bandwidth [GB/s]
Diff
160 Assy.
Peak
140
120
100
80
60
40
201 2 3 4 5 6 7 8 9
Polynomial Order N
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
88. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
89. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
90. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
CFD
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
91. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
92. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive and
error-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
93. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive and
error-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffs
Tradeoffs require heuristics
Heuristics are fragile
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
94. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive and
error-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffs
Tradeoffs require heuristics
Heuristics are fragile
Another way: Dumb enumeration
Enumerate loop slicings
Enumerate prefetch options
Choose by running resulting code on actual hardware
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
95. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Example
Empirical GPU loop optimization:
a, b, c, i , j , k = [var(s) for s in ” abcijk ”]
n = 500
k = make loop kernel([
LoopDimension(”i”, n),
LoopDimension(”j”, n),
LoopDimension(”k”, n),
], [
(c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j])
])
gen kwargs = {
”min threads”: 128,
”min blocks”: 32,
}
→ Ideal case: Finds 160 GF/s kernel
without human intervention.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
96. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Status
Limited scope:
Require input/output separation
Kernels must be expressible using
“loopy” model
(i.e. indices decompose into “output”
and “reduction”)
Enough for DG, LA, FD, . . .
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
97. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Status
Limited scope:
Require input/output separation
Kernels must be expressible using
“loopy” model
(i.e. indices decompose into “output”
and “reduction”)
Enough for DG, LA, FD, . . .
Kernel compilation limits trial rate
Non-Goal: Peak performance
Good results currently for dense linear
algebra and (some) DG subkernels
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
98. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
99. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Where to from here?
PyCUDA, PyOpenCL, hedge
→ http://www.cims.nyu.edu/~kloeckner/
GPU RTCG
AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for
High-Performance Computing, submitted.
GPU-DG Article
AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal
Discontinuous Galerkin Methods on Graphics Processors”,
J. Comp. Phys., 228 (21), 7863–7882.
Also: Intro in GPU Computing Gems Vol 2
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
100. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Conclusions
GPUs to me: architecture choice now widely available
Fun time to be in computational science
GPUs and scripting work surprisingly well together
Exploit a natural task decomposition in computational codes
RTCG: Crucial tool
GPU Scripting great for prototyping
. . . and just as suitable for production code
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
101. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Questions?
?
Thank you for your attention!
http://www.cims.nyu.edu/~kloeckner/
image credits
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
102. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Image Credits
Dictionary: sxc.hu/topfer
C870 GPU: Nvidia Corp.
OpenCL Logo: Apple Corp./Ars Technica
OS Platforms: flickr.com/aOliN.Tk
Old Books: flickr.com/ppdigital
Floppy disk: flickr.com/ethanhein
Machine: flickr.com/13521837@N00
Adding Machine: flickr.com/thomashawk
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
103. Implementations
Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs
Flop Rates: 16 GPUs vs 64 CPU cores
4000 GPU
CPU
3000
GFlops/s
2000
1000
00 2 4 6 8 10
Polynomial Order N
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
104. Implementations
Outline
5 OpenCL implementations
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
105. Implementations
The Nvidia CL implementation
Targets only GPUs
Notes:
Nearly identical to CUDA
No native C-level JIT in CUDA (→
PyCUDA)
Page-locked memory:
Use CL MEM ALLOC HOST PTR.
Careful: double meaning
Need page-locked memory for genuinely
overlapped transfers.
No linear memory texturing
CUDA device emulation mode deprecated
→ Use AMD CPU CL (faster, too!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
106. Implementations
The Apple CL implementation
Targets CPUs and GPUs
General notes:
Different header name
OpenCL/cl.h instead of CL/cl.h
Use -framework OpenCL for C
access.
Beware of imperfect compiler cache
implementation
(ignores include files)
CPU notes:
One work item per processor
GPU similar to hardware vendor
implementation.
(New: Intel w/ Sandy Bridge)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
107. Implementations
The AMD CL implementation
Targets CPUs and GPUs (from both AMD and Nvidia)
GPU notes:
Wide SIMD groups (64)
Native 4/5-wide vectors
But: very flop-heavy machine, may ignore vectors
for memory-bound workloads
→ Both implicit and explicit SIMD
CPU notes:
Many work items per processor (emulated)
General:
cl amd printf
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA