From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
The document discusses power management in ARMv8-A and the integration of OP-TEE with the ARM Trusted Firmware. It provides an overview of the software stack and PSCI requirements. It then describes OP-TEE's system view and how it integrates with ARM Trusted Firmware as a runtime service. Finally, it discusses the programmer's view of PSCI and provides examples of how CPU_ON, CPU_OFF, and CPU_SUSPEND operations are handled between Linux, ARM Trusted Firmware, and OP-TEE.
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARMLinaro
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARM
Speakers: Zoltan Kuscsik
Date: September 22, 2015
★ Session Description ★
This presentation gives an overview of how various components of set-top software are integrated to provide a W3C EME solution employing a commercial DRM integrated with an open source TEE running on ARM TrustZone.
★ Resources ★
Video: https://www.youtube.com/watch?v=defbtpsw6h8
Presentation: http://www.slideshare.net/linaroorg/sfo15205-optee-content-decryption-with-microsoft-playready-on-arm-53111683
Etherpad: pad.linaro.org/p/sfo15-205
Pathable: https://sfo15.pathable.com/meetings/302837
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
BKK16-201 Play Ready OPTEE Integration with Secure Video Path lhg-1Linaro
This presentation provides a current view of the Security work performed in LHG. The focus is on hardware protected DRM integrated with OP TEE, creation of a Secure Data Path coupled with the Open Content Decryption Module, and the lessons learned from integrating third party libraries into trusted applications.
The second part of Linux Internals covers system calls, process subsystem and inter process communication mechanisms. Understanding these services provided by Linux are essential for embedded systems engineer.
The document provides step-by-step instructions for building and running Intel DPDK sample applications on a test environment with 3 virtual machines connected by 10G NICs. It describes compiling and running the helloworld, L2 forwarding, and L3 forwarding applications, as well as using the pktgen tool for packet generation between VMs to test forwarding performance. Key steps include preparing the Linux kernel for DPDK, compiling applications, configuring ports and MAC addresses, and observing packet drops to identify performance bottlenecks.
U-boot provides a multistage boot process that initializes the CPU and board resources incrementally at each stage. It begins execution on the CPU in a limited environment and hands off to subsequent stages that gain access to more resources like memory and devices. U-boot supports booting an operating system image from storage like SSD or over the network and offers features like secure boot and hypervisor support.
LCU13: Deep Dive into ARM Trusted Firmware
Resource: LCU13
Name: Deep Dive into ARM Trusted Firmware
Date: 31-10-2013
Speaker: Dan Handley / Charles Garcia-Tobin
This third part of Linux internals talks about Thread programming and using various synchronization mechanisms like mutex and semaphores. These constructs helps users to write efficient programs in Linux environment
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
The document discusses power management in ARMv8-A and the integration of OP-TEE with the ARM Trusted Firmware. It provides an overview of the software stack and PSCI requirements. It then describes OP-TEE's system view and how it integrates with ARM Trusted Firmware as a runtime service. Finally, it discusses the programmer's view of PSCI and provides examples of how CPU_ON, CPU_OFF, and CPU_SUSPEND operations are handled between Linux, ARM Trusted Firmware, and OP-TEE.
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARMLinaro
SFO15-205: OP-TEE Content Decryption with Microsoft PlayReady on ARM
Speakers: Zoltan Kuscsik
Date: September 22, 2015
★ Session Description ★
This presentation gives an overview of how various components of set-top software are integrated to provide a W3C EME solution employing a commercial DRM integrated with an open source TEE running on ARM TrustZone.
★ Resources ★
Video: https://www.youtube.com/watch?v=defbtpsw6h8
Presentation: http://www.slideshare.net/linaroorg/sfo15205-optee-content-decryption-with-microsoft-playready-on-arm-53111683
Etherpad: pad.linaro.org/p/sfo15-205
Pathable: https://sfo15.pathable.com/meetings/302837
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
BKK16-201 Play Ready OPTEE Integration with Secure Video Path lhg-1Linaro
This presentation provides a current view of the Security work performed in LHG. The focus is on hardware protected DRM integrated with OP TEE, creation of a Secure Data Path coupled with the Open Content Decryption Module, and the lessons learned from integrating third party libraries into trusted applications.
The second part of Linux Internals covers system calls, process subsystem and inter process communication mechanisms. Understanding these services provided by Linux are essential for embedded systems engineer.
The document provides step-by-step instructions for building and running Intel DPDK sample applications on a test environment with 3 virtual machines connected by 10G NICs. It describes compiling and running the helloworld, L2 forwarding, and L3 forwarding applications, as well as using the pktgen tool for packet generation between VMs to test forwarding performance. Key steps include preparing the Linux kernel for DPDK, compiling applications, configuring ports and MAC addresses, and observing packet drops to identify performance bottlenecks.
U-boot provides a multistage boot process that initializes the CPU and board resources incrementally at each stage. It begins execution on the CPU in a limited environment and hands off to subsequent stages that gain access to more resources like memory and devices. U-boot supports booting an operating system image from storage like SSD or over the network and offers features like secure boot and hypervisor support.
LCU13: Deep Dive into ARM Trusted Firmware
Resource: LCU13
Name: Deep Dive into ARM Trusted Firmware
Date: 31-10-2013
Speaker: Dan Handley / Charles Garcia-Tobin
This third part of Linux internals talks about Thread programming and using various synchronization mechanisms like mutex and semaphores. These constructs helps users to write efficient programs in Linux environment
DPDK is a set of drivers and libraries that allow applications to bypass the Linux kernel and access network interface cards directly for very high performance packet processing. It is commonly used for software routers, switches, and other network applications. DPDK can achieve over 11 times higher packet forwarding rates than applications using the Linux kernel network stack alone. While it provides best-in-class performance, DPDK also has disadvantages like reduced security and isolation from standard Linux services.
nftables - the evolution of Linux FirewallMarian Marinov
This document provides an overview of nftables, the new packet filtering framework that replaces iptables in the Linux kernel. It discusses the history and predecessors to nftables, how nftables works, key differences from iptables like its more flexible table and chain configuration, and examples of basic nftables rulesets. It also covers topics like matches, jumps, load balancing performance, and kernel configuration options for nftables.
SFO15-302: Energy Aware Scheduling: Progress UpdateLinaro
1. The document discusses introducing generic energy-awareness into the upstream Linux kernel through a new Energy Aware Scheduling (EAS) framework.
2. EAS aims to provide a common solution for scheduling tasks across different CPU topologies in a way that minimizes energy usage based on an energy model of the hardware.
3. The EAS framework integrates idle state, DVFS, and big.LITTLE support and is designed to be clean, based on measurable energy data, and support future CPU topologies with reduced software maintenance costs.
1. DPDK achieves high throughput packet processing on commodity hardware by reducing kernel overhead through techniques like polling, huge pages, and userspace drivers.
2. In Linux, packet processing involves expensive operations like system calls, interrupts, and data copying between kernel and userspace. DPDK avoids these by doing all packet processing in userspace.
3. DPDK uses techniques like isolating cores for packet I/O threads, lockless ring buffers, and NUMA awareness to further optimize performance. It can achieve throughput of over 14 million packets per second on 10GbE interfaces.
This document discusses XDP (eXpress Data Path), a high-performance network data path that allows programs to run on the receive path of a network interface card. XDP enables packet processing using eBPF programs before packets reach the Linux networking stack. The document provides an overview of XDP and its performance advantages over other packet processing methods. It also discusses XDP's current status and support in the Linux kernel as well as example use cases and benchmarks.
Embedded linux system development (slides)Jaime Barragan
This document provides an introduction to embedded Linux system development. It discusses Free Electrons, an engineering company focused on embedded Linux, the Linux kernel, and Android. It outlines the hardware that will be used in the training session, including Atmel SAMA5D3 Xplained boards. It provides guidelines for participating in lectures and practical labs.
LCU14-103: How to create and run Trusted Applications on OP-TEELinaro
LCU14-103: How to create and run Trusted Applications on OP-TEE
---------------------------------------------------
Speaker: Joakim Bech
Date: September 15, 2014
---------------------------------------------------
Coresight is the name given to a set of IP blocks providing hardware assisted tracing for ARM based SoCs. This presentation will give an introduction to the technology, how it works and offer a glimpse of the capabilities it offers. More specifically we will go over the components that are part of the architecture and how they are used. Next will be presented the framework Linaro is working on in an effort to provide consolidation and standardization of interfaces to the coresight subsystem. We will conclude with a status of our current upstreaming efforts and how we see the coming months unfolding.
---------------------------------------------------
★ Resources ★
Zerista: http://lcu14.zerista.com/event/member/137703
Google Event: https://plus.google.com/u/0/events/cvb85kqv10dsc4k3e0hcvbr6i58
Presentation: http://www.slideshare.net/linaroorg/lcu14-101-coresight-overview
Video: https://www.youtube.com/watch?v=IQhbM55F23U&list=UUIVqQKxCyQLJS6xvSmfndLA
Etherpad: http://pad.linaro.org/p/lcu14-101
---------------------------------------------------
★ Event Details ★
Linaro Connect USA - #LCU14
September 15-19th, 2014
Hyatt Regency San Francisco Airport
---------------------------------------------------
Here are some useful GDB commands for debugging:
- break <function> - Set a breakpoint at a function
- break <file:line> - Set a breakpoint at a line in a file
- run - Start program execution
- next/n - Step over to next line, stepping over function calls
- step/s - Step into function calls
- finish - Step out of current function
- print/p <variable> - Print value of a variable
- backtrace/bt - Print the call stack
- info breakpoints/ib - List breakpoints
- delete <breakpoint#> - Delete a breakpoint
- layout src - Switch layout to source code view
- layout asm - Switch layout
AMD has been away from the HPC space for a while, but now they are coming back in a big way with an open software approach to GPU computing. The Radeon Open Compute Platform (ROCm) was born from the Boltzman Initiative announced last year at SC15. Now available on GitHub, the ROCm Platform bringing a rich foundation to advanced computing by better integrating the CPU and GPU to solve real-world problems.
"We are excited to present ROCm, the first open-source HPC/ultrascale-class platform for GPU computing that’s also programming-language independent. We are bringing the UNIX philosophy of choice, minimalism and modular software development to GPU computing. The new ROCm foundation lets you choose or even develop tools and a language run time for your application."
Watch the video presentation: http://wp.me/p3RLHQ-fJT
Learn more: https://radeonopencompute.github.io/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
SFO15-200: Linux kernel generic TEE driver
Speaker: Jens Wiklander
Date: September 22, 2015
★ Session Description ★
At this session we will get more knowledge about the TEE driver that Linaro has been working on for the last couple of months. Questions to be answered are for example: What are the API’s? How does the TEE driver work as a communication channel. What will a developer need to think of when adding support for another TEE solution?
★ Resources ★
Video: https://www.youtube.com/watch?v=BhLndLUQamM
Presentation: http://www.slideshare.net/linaroorg/sfo15200-linux-kernel-generic-tee-driver
Etherpad: pad.linaro.org/p/sfo15-200
Pathable: https://sfo15.pathable.com/meetings/302831
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
1) Verified boot is the process of assuring users of the integrity of software running on a device by reducing risks from malware and preventing rollbacks to vulnerable past versions. It uses hashing, public key cryptography, and tamper-evident storage.
2) Android Verified Boot 2.0 (AVB) is Google's recommended method for verified boot integration. It uses a signed VBMeta structure containing hashes, hashtrees and rollback indexes to verify the integrity of partitions before booting.
3) AVB supports features like A/B partitions, locked/unlocked device states, and delegates verification authority through chained partitions. It interacts with bootloaders, uses avbtool to generate signatures,
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...Linaro
Session ID: HKG18-411
Session Name: HKG18-411 - Introduction to OpenAMP which is an open source solution for heterogeneous system orchestration and communication
Speaker: Wendy Liang
Track: IoT, Embedded
★ Session Summary ★
Introduction to OpenAMP which is an open source solution for heterogeneous system orchestration and communication
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/hkg18/hkg18-411/
Presentation: http://connect.linaro.org.s3.amazonaws.com/hkg18/presentations/hkg18-411.pdf
Video: http://connect.linaro.org.s3.amazonaws.com/hkg18/videos/hkg18-411.mp4
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2018 (HKG18)
19-23 March 2018
Regal Airport Hotel Hong Kong
---------------------------------------------------
Keyword: IoT, Embedded
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
"Session ID: BUD17-400
Session Name: Secure Data Path with OPTEE - BUD17-400
Speaker: Mark Gregotski
Track: LHG
★ Session Summary ★
LHG is using the ION-based secure memory allocator integrated with OPTEE as the basis for secure data path processing pipeline. LHG is following the W3C EME protocol and supporting Content Decryption Modules (CDMs) from Widevine and PlayReady.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/bud17/bud17-400/
Presentation: https://www.slideshare.net/linaroorg/bud17400-secure-data-path-with-optee
Video: https://youtu.be/6JdzsWZq4Ls
---------------------------------------------------
★ Event Details ★
Linaro Connect Budapest 2017 (BUD17)
6-10 March 2017
Corinthia Hotel, Budapest,
Erzsébet krt. 43-49,
1073 Hungary
---------------------------------------------------
Keyword: LHG, secure-data, OPTEE
http://www.linaro.org
http://connect.linaro.org
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961"
The document discusses the Android audio system initialization process and the creation of playback and recording threads. The audio HAL library is loaded based on the device properties, and the AudioFlinger service initializes and manages the audio streams. It creates a MixerThread for playback using the audio HAL output, and a RecordThread is generated for audio input using the HAL functions.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/making-edge-ai-inference-programming-easier-and-flexible-a-presentation-from-texas-instruments/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Manisha Agrawal, Product Marketing Engineer at Texas Instruments, presents the “Making Edge AI Inference Programming Easier and Flexible” tutorial at the September 2020 Embedded Vision Summit.
Deploying an AI model at the edge doesn’t have to be challenging—but it often is. Embedded processing vendors have unique sets of software tools for deploying models. It takes time and investment to learn to use proprietary tools and to optimize the edge implementation to achieve your desired performance. While embedded vendors are providing proprietary tools for model deployment, the open source community is also advancing to standardize the model deployment process and make it hardware agnostic.
Texas Instruments has adopted open source software frameworks to make model deployment easier and more flexible. In this talk, you will learn about the struggles developers face when deploying models for inference on embedded processors and how TI addresses these critical software development challenges. You will also discover how TI enables faster time-to-market using a flexible open source development approach without the need to compromise performance, accuracy or power requirements.
The document discusses scheduling in Android. It provides an overview of the history of Linux scheduling including the Completely Fair Scheduler (CFS) and scheduling classes. It describes how CPU power management integrates with scheduling and discusses load tracking methods. The document outlines some problems Android has had with the Linux scheduler and solutions developed. It also summarizes how the Android framework interacts with and provides hints to the underlying Linux scheduler.
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UKHARDWARIO
Pavel Hübner (from HARDWARIO) will provide a crash course into the Zephyr RTOS. Zephyr is an innovative operating system targeting 32-bit microcontrollers and is suitable for connected IoT products. Such devices are often low-power and provide a multi-year battery lifespan. The ambitious 60-minute live session with be held on a configurable IoT gateway CHESTER - a platform based on Nordic Semiconductor SoCs nRF52840 / nRF9160. Throughout the course, Pavel will go from the key Zephyr fundamentals to connecting a fully-fledged IoT application over the NB-IoT network.
A 2015 presentation to introduce users to Java profiling. The Yourkit Profiler is used for concrete examples. The following topics are covered:
1) When to profile
2) Profiler sampling
3) Profiler instrumentation
4) Where to Start
5) Macro vs micro benchmarking
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.
DPDK is a set of drivers and libraries that allow applications to bypass the Linux kernel and access network interface cards directly for very high performance packet processing. It is commonly used for software routers, switches, and other network applications. DPDK can achieve over 11 times higher packet forwarding rates than applications using the Linux kernel network stack alone. While it provides best-in-class performance, DPDK also has disadvantages like reduced security and isolation from standard Linux services.
nftables - the evolution of Linux FirewallMarian Marinov
This document provides an overview of nftables, the new packet filtering framework that replaces iptables in the Linux kernel. It discusses the history and predecessors to nftables, how nftables works, key differences from iptables like its more flexible table and chain configuration, and examples of basic nftables rulesets. It also covers topics like matches, jumps, load balancing performance, and kernel configuration options for nftables.
SFO15-302: Energy Aware Scheduling: Progress UpdateLinaro
1. The document discusses introducing generic energy-awareness into the upstream Linux kernel through a new Energy Aware Scheduling (EAS) framework.
2. EAS aims to provide a common solution for scheduling tasks across different CPU topologies in a way that minimizes energy usage based on an energy model of the hardware.
3. The EAS framework integrates idle state, DVFS, and big.LITTLE support and is designed to be clean, based on measurable energy data, and support future CPU topologies with reduced software maintenance costs.
1. DPDK achieves high throughput packet processing on commodity hardware by reducing kernel overhead through techniques like polling, huge pages, and userspace drivers.
2. In Linux, packet processing involves expensive operations like system calls, interrupts, and data copying between kernel and userspace. DPDK avoids these by doing all packet processing in userspace.
3. DPDK uses techniques like isolating cores for packet I/O threads, lockless ring buffers, and NUMA awareness to further optimize performance. It can achieve throughput of over 14 million packets per second on 10GbE interfaces.
This document discusses XDP (eXpress Data Path), a high-performance network data path that allows programs to run on the receive path of a network interface card. XDP enables packet processing using eBPF programs before packets reach the Linux networking stack. The document provides an overview of XDP and its performance advantages over other packet processing methods. It also discusses XDP's current status and support in the Linux kernel as well as example use cases and benchmarks.
Embedded linux system development (slides)Jaime Barragan
This document provides an introduction to embedded Linux system development. It discusses Free Electrons, an engineering company focused on embedded Linux, the Linux kernel, and Android. It outlines the hardware that will be used in the training session, including Atmel SAMA5D3 Xplained boards. It provides guidelines for participating in lectures and practical labs.
LCU14-103: How to create and run Trusted Applications on OP-TEELinaro
LCU14-103: How to create and run Trusted Applications on OP-TEE
---------------------------------------------------
Speaker: Joakim Bech
Date: September 15, 2014
---------------------------------------------------
Coresight is the name given to a set of IP blocks providing hardware assisted tracing for ARM based SoCs. This presentation will give an introduction to the technology, how it works and offer a glimpse of the capabilities it offers. More specifically we will go over the components that are part of the architecture and how they are used. Next will be presented the framework Linaro is working on in an effort to provide consolidation and standardization of interfaces to the coresight subsystem. We will conclude with a status of our current upstreaming efforts and how we see the coming months unfolding.
---------------------------------------------------
★ Resources ★
Zerista: http://lcu14.zerista.com/event/member/137703
Google Event: https://plus.google.com/u/0/events/cvb85kqv10dsc4k3e0hcvbr6i58
Presentation: http://www.slideshare.net/linaroorg/lcu14-101-coresight-overview
Video: https://www.youtube.com/watch?v=IQhbM55F23U&list=UUIVqQKxCyQLJS6xvSmfndLA
Etherpad: http://pad.linaro.org/p/lcu14-101
---------------------------------------------------
★ Event Details ★
Linaro Connect USA - #LCU14
September 15-19th, 2014
Hyatt Regency San Francisco Airport
---------------------------------------------------
Here are some useful GDB commands for debugging:
- break <function> - Set a breakpoint at a function
- break <file:line> - Set a breakpoint at a line in a file
- run - Start program execution
- next/n - Step over to next line, stepping over function calls
- step/s - Step into function calls
- finish - Step out of current function
- print/p <variable> - Print value of a variable
- backtrace/bt - Print the call stack
- info breakpoints/ib - List breakpoints
- delete <breakpoint#> - Delete a breakpoint
- layout src - Switch layout to source code view
- layout asm - Switch layout
AMD has been away from the HPC space for a while, but now they are coming back in a big way with an open software approach to GPU computing. The Radeon Open Compute Platform (ROCm) was born from the Boltzman Initiative announced last year at SC15. Now available on GitHub, the ROCm Platform bringing a rich foundation to advanced computing by better integrating the CPU and GPU to solve real-world problems.
"We are excited to present ROCm, the first open-source HPC/ultrascale-class platform for GPU computing that’s also programming-language independent. We are bringing the UNIX philosophy of choice, minimalism and modular software development to GPU computing. The new ROCm foundation lets you choose or even develop tools and a language run time for your application."
Watch the video presentation: http://wp.me/p3RLHQ-fJT
Learn more: https://radeonopencompute.github.io/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
SFO15-200: Linux kernel generic TEE driver
Speaker: Jens Wiklander
Date: September 22, 2015
★ Session Description ★
At this session we will get more knowledge about the TEE driver that Linaro has been working on for the last couple of months. Questions to be answered are for example: What are the API’s? How does the TEE driver work as a communication channel. What will a developer need to think of when adding support for another TEE solution?
★ Resources ★
Video: https://www.youtube.com/watch?v=BhLndLUQamM
Presentation: http://www.slideshare.net/linaroorg/sfo15200-linux-kernel-generic-tee-driver
Etherpad: pad.linaro.org/p/sfo15-200
Pathable: https://sfo15.pathable.com/meetings/302831
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
1) Verified boot is the process of assuring users of the integrity of software running on a device by reducing risks from malware and preventing rollbacks to vulnerable past versions. It uses hashing, public key cryptography, and tamper-evident storage.
2) Android Verified Boot 2.0 (AVB) is Google's recommended method for verified boot integration. It uses a signed VBMeta structure containing hashes, hashtrees and rollback indexes to verify the integrity of partitions before booting.
3) AVB supports features like A/B partitions, locked/unlocked device states, and delegates verification authority through chained partitions. It interacts with bootloaders, uses avbtool to generate signatures,
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...Linaro
Session ID: HKG18-411
Session Name: HKG18-411 - Introduction to OpenAMP which is an open source solution for heterogeneous system orchestration and communication
Speaker: Wendy Liang
Track: IoT, Embedded
★ Session Summary ★
Introduction to OpenAMP which is an open source solution for heterogeneous system orchestration and communication
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/hkg18/hkg18-411/
Presentation: http://connect.linaro.org.s3.amazonaws.com/hkg18/presentations/hkg18-411.pdf
Video: http://connect.linaro.org.s3.amazonaws.com/hkg18/videos/hkg18-411.mp4
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2018 (HKG18)
19-23 March 2018
Regal Airport Hotel Hong Kong
---------------------------------------------------
Keyword: IoT, Embedded
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
"Session ID: BUD17-400
Session Name: Secure Data Path with OPTEE - BUD17-400
Speaker: Mark Gregotski
Track: LHG
★ Session Summary ★
LHG is using the ION-based secure memory allocator integrated with OPTEE as the basis for secure data path processing pipeline. LHG is following the W3C EME protocol and supporting Content Decryption Modules (CDMs) from Widevine and PlayReady.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/bud17/bud17-400/
Presentation: https://www.slideshare.net/linaroorg/bud17400-secure-data-path-with-optee
Video: https://youtu.be/6JdzsWZq4Ls
---------------------------------------------------
★ Event Details ★
Linaro Connect Budapest 2017 (BUD17)
6-10 March 2017
Corinthia Hotel, Budapest,
Erzsébet krt. 43-49,
1073 Hungary
---------------------------------------------------
Keyword: LHG, secure-data, OPTEE
http://www.linaro.org
http://connect.linaro.org
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961"
The document discusses the Android audio system initialization process and the creation of playback and recording threads. The audio HAL library is loaded based on the device properties, and the AudioFlinger service initializes and manages the audio streams. It creates a MixerThread for playback using the audio HAL output, and a RecordThread is generated for audio input using the HAL functions.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/making-edge-ai-inference-programming-easier-and-flexible-a-presentation-from-texas-instruments/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Manisha Agrawal, Product Marketing Engineer at Texas Instruments, presents the “Making Edge AI Inference Programming Easier and Flexible” tutorial at the September 2020 Embedded Vision Summit.
Deploying an AI model at the edge doesn’t have to be challenging—but it often is. Embedded processing vendors have unique sets of software tools for deploying models. It takes time and investment to learn to use proprietary tools and to optimize the edge implementation to achieve your desired performance. While embedded vendors are providing proprietary tools for model deployment, the open source community is also advancing to standardize the model deployment process and make it hardware agnostic.
Texas Instruments has adopted open source software frameworks to make model deployment easier and more flexible. In this talk, you will learn about the struggles developers face when deploying models for inference on embedded processors and how TI addresses these critical software development challenges. You will also discover how TI enables faster time-to-market using a flexible open source development approach without the need to compromise performance, accuracy or power requirements.
The document discusses scheduling in Android. It provides an overview of the history of Linux scheduling including the Completely Fair Scheduler (CFS) and scheduling classes. It describes how CPU power management integrates with scheduling and discusses load tracking methods. The document outlines some problems Android has had with the Linux scheduler and solutions developed. It also summarizes how the Android framework interacts with and provides hints to the underlying Linux scheduler.
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UKHARDWARIO
Pavel Hübner (from HARDWARIO) will provide a crash course into the Zephyr RTOS. Zephyr is an innovative operating system targeting 32-bit microcontrollers and is suitable for connected IoT products. Such devices are often low-power and provide a multi-year battery lifespan. The ambitious 60-minute live session with be held on a configurable IoT gateway CHESTER - a platform based on Nordic Semiconductor SoCs nRF52840 / nRF9160. Throughout the course, Pavel will go from the key Zephyr fundamentals to connecting a fully-fledged IoT application over the NB-IoT network.
A 2015 presentation to introduce users to Java profiling. The Yourkit Profiler is used for concrete examples. The following topics are covered:
1) When to profile
2) Profiler sampling
3) Profiler instrumentation
4) Where to Start
5) Macro vs micro benchmarking
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
When working with big data or complex algorithms, we often look to parallelize our code to optimize runtime. By taking advantage of a GPUs 1000+ cores, a data scientist can quickly scale out solutions inexpensively and sometime more quickly than using traditional CPU cluster computing. In this webinar, we will present ways to incorporate GPU computing to complete computationally intensive tasks in both Python and R.
See the full presentation here: 👉 https://vimeo.com/153290051
Learn more about the Domino data science platform: https://www.dominodatalab.com
GPU profiling for computer vision applicationsMai Nishimura
NVIDIA provides several tools for profiling GPU performance of computer vision applications, including nvprof, nvvp, and the next-generation Nsight Compute and Nsight Systems. Nvprof allows command-line profiling with different modes, while nvvp provides a GUI interface for visualizing profiling results. These tools help analyze kernel performance, identify bottlenecks like compute or memory limitations, and optimize applications. Tensorflow also includes a timeline tool for profiling graph execution.
This document discusses porting, scaling, and optimizing applications on Cray XT systems. It covers topics such as choosing compilers, profiling and debugging applications at scale, understanding CPU affinity, and improvements in the Cray Message Passing Toolkit (MPT). The document provides guidance on leveraging different compilers, collecting performance data using hardware counters and CrayPAT, understanding MPI process binding, and enhancements in MPT 4.0 related to MPI standards support and communication optimizations.
Profiling your Applications using the Linux Perf ToolsemBO_Conference
This document provides an overview of using the Linux perf tools to profile applications. It discusses setting up perf, benchmarking applications, profiling both CPU usage and sleep times, and analyzing profiling data. The document covers perf commands like perf record to collect profiling data, perf report to analyze the data, and perf script to convert it to other formats. It also discusses profiling options like call graphs and collecting kernel vs. user mode events.
Scaling AI in production using PyTorchgeetachauhan
Slides from my talk at MLOps World' 21
Deploying AI models in production and scaling the ML services is still a big challenge. In this talk we will cover details of how to deploy your AI models, best practices for the deployment scenarios, and techniques for performance optimization and scaling the ML services. Come join us to learn how you can jumpstart the journey of taking your PyTorch models from Research to production.
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
This document provides an overview of pragmatic optimization approaches for modern programming. It discusses understanding code and algorithms, optimizing memory access patterns, minimizing operations, and shrinking critical paths. Hardware-specific optimizations and diving into assembly are also covered. The key lessons are to find and optimize critical code sections, gain knowledge of the code, compiler, and platform, and apply optimizations in an iterative process starting from high-level choices before making low-level changes. Recommended literature on computer architecture, optimization, and compiler engineering is provided.
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
This presentation is for Go developers and operators of Go applications who are interested in reducing costs and latency, or debugging problems such as memory leaks, infinite loops, performance regressions, etc. of such applications. We'll start with a brief description of the unique aspects of the Go runtime, and then take a look at the builtin profilers as well as Go's execution tracer. Additionally we'll look at the interoperability with popular observability tools such as Linux perf and bpftrace. After this presentation you should have a good idea of the various tools you can use, and which ones might be the most useful to you in a production environment.
With the introduction of FPGAs in the cloud, there is an increasing need for solutions able to accelerate traditional CPU code with minimum burden on the user, while retaining competitive performance. In this presentation, we illustrate OXiGen, a tool for the acceleration of dataflow-oriented C applications on FPGA-based systems. The tool offers a complete design flow to optimize C functions into dataflow accelerated kernels and an automated frequency-aware design-space exploration that selects an optimal set of optimizations for the given function. It allows to automatically simulate the resulting function by generating a testbench for the function. We compare the generated hardware designs against both the respective software implementations and state-of-the-art dataflow designs, reaching comparable performance with a hardware design generated in a few seconds.
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...inside-BigData.com
Machine learning algorithms are increasingly being used across many domains and are affected by technology trends. Deep learning techniques have achieved human-level performance in tasks like speech recognition and face recognition. Training machine learning models requires massive parallelism and computational resources that are well-suited to GPU and multi-core architectures. Reduced precision computation can accelerate training but may impact convergence. Specialized hardware continues to evolve for both training and inference.
This slide will show you how to use SOFA to do performance analysis of CPU/GPU cooperative programs, especially programs running with deep software stacks like TensorFlow, PyTorch, etc.
source code at:
https://github.com/cyliustack/sofa
1032 cs208 g operation system ip camera case share.v0.2Stanley Ho
The document discusses optimizing video performance on Android devices. It begins with an example of using VideoView and MediaPlayer to play video files. It then discusses issues with solely using the ffmpeg software library for video decoding, which can only decode one VGA stream or higher resolutions will lag. Profiling is suggested to find bottlenecks. Potential optimizations mentioned include using ARM/Thumb instructions, concurrent processing, offloading decoding to the GPU, and utilizing hardware video decoding components to meet frame rate targets.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
This document discusses various techniques for identifying performance limiters in GPU code using CUDA. It recommends timing different parts of code, profiling to collect metrics and events, prototyping kernel parts separately, and benchmarking hardware characteristics. It provides examples of measuring wall time and GPU time. It also lists common profiling events, metrics, and discusses a case study of profiling a matrix transpose. The document emphasizes that profiling helps verify assumptions and identify bottlenecks, but does not replace optimization work.
Postgres Vision 2018: Making Postgres Even FasterEDB
Andres Freund, a Senior Database Architect at EnterpriseDB, is one of the leading developers of PostgreSQL and his work has been influential in advancing the replication, performance, and scalability capabilities of Postgres. In this presentation he delivered at Postgres Vision 2018, Freund discusses JIT and general performance enhancements to Postgres and explains why PostgreSQL 11 will be the best option for application developers.
Similar to Profiling PyTorch for Efficiency & Sustainability (20)
Building AI with Security Privacy in Mindgeetachauhan
The document discusses building AI with security and privacy in mind. It covers privacy challenges in AI like tensions between data privacy and model training. It then discusses various privacy preserving machine learning techniques like homomorphic encryption, differential privacy, secure multi-party computation, on-device computation, and federated learning. The document provides examples of how each technique works. It concludes by discussing tools and techniques for starting a privacy journey in AI and provides resources to learn more.
Building AI with Security and Privacy in mindgeetachauhan
The document discusses building AI with security and privacy in mind. It covers privacy challenges in AI like tensions between data privacy and model training. It then discusses various privacy preserving machine learning techniques like homomorphic encryption, differential privacy, secure multi-party computation, on-device computation, and federated learning. The document provides examples of how each technique works. It concludes by discussing tools and techniques for starting a privacy journey in AI and provides resources to learn more.
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
Slides from my talk at Deep Learning World 2020. The talk covered use cases, special challenges and solutions for building Interpretable and Secure AI systems using Pytorch.
- Tools for building Interpretable models
- How to build secure, privacy preserving AI models with Pytorch
- Use cases and insights from the field
Slides from Talk @ Intel IoT DevFest IV
With both Facebook and Google's recent shift in direction towards a "Future is Private" world, learn how you too can train and deploy your AI models in a privacy-preserving way, with Decentralized AI and a combination of AI and Blockchain. These techniques will become even more rampant as we move into a world where users will own their own data and companies will start using “ethically sourced data” and move towards a path for Ethical AI for the IoT space.
In this session, you will learn:
- Use cases for Decentralized AI, with combined benefits of AI + Blockchain for IoT applications
- Federated learning & related privacy-preserving AI model training techniques for IoT applications
- How to build Ethical AI solutions for IoT using these techniques
Draper Accelerator Talk Slides - convering convergence of of AI and Blockchain and how it solves challenges for IoT, Ai@Edge and Data Ethics and User Data Monetization.
Decentralized AI: Convergence of AI + Blockchain geetachauhan
Santa Clara IoT Expo talk slides - convering convergence of of AI and Blockchain and how it solves challenges for IoT, Ai@Edge and Data Ethics and User Data Monetization
Decentralized AI: Convergence of Blockchain + AIgeetachauhan
This document discusses the convergence of blockchain and AI through decentralized AI approaches. It outlines challenges with centralized AI models regarding privacy, influence, economics and transparency. Decentralized solutions proposed include federated learning, blockchain, homomorphic encryption, and data marketplaces. Blockchain provides an open, trustless network to replace centralized authorities and enable applications like data exchanges, AI marketplaces and distributed machine learning across devices. Overall the goal is to democratize AI and data through user ownership and control.
Decentralized AI: Convergence of Blockchain + AIgeetachauhan
As we move into the world where User's will own their own data, and companies will use "Ethically Sourced Data", there will be a rampant need for Decentralized AI. And, combining with Blockchain one gets viable Business Models. This talk covers use cases for convergence of Blockchain and AI.
Talk @ ACM SF Bayarea Chapter on Deep Learning for medical imaging space.
The talk covers use cases, special challenges and solutions for Deep Learning for Medical Image Analysis using Tensorflow+Keras. You will learn about:
- Use cases for Deep Learning in Medical Image Analysis
- Different DNN architectures used for Medical Image Analysis
- Special purpose compute / accelerators for Deep Learning (in the Cloud / On-prem)
- How to parallelize your models for faster training of models and serving for inferenceing.
- Optimization techniques to get the best performance from your cluster (like Kubernetes/ Apache Mesos / Spark)
- How to build an efficient Data Pipeline for Medical Image Analysis using Deep Learning
- Resources to jump start your journey - like public data sets, common models used in Medical Image Analysis
The document discusses deep learning techniques for financial technology (FinTech) applications. It begins with examples of current deep learning uses in FinTech like trading algorithms, fraud detection, and personal finance assistants. It then covers topics like specialized compute hardware for deep learning training and inference, optimization techniques for CPUs and GPUs, and distributed training approaches. Finally, it discusses emerging areas like FPGA and quantum computing and provides resources for practitioners to start with deep learning for FinTech.
NIPS - Deep learning @ Edge using Intel's NCSgeetachauhan
The document discusses using Intel's Neural Compute Stick for deep learning at the edge. It introduces the Neural Compute Stick, which enables computer vision and AI capabilities in small, low power devices. It then provides an overview of deep learning and discusses how to build IoT applications using the Neural Compute Stick SDK. Examples of use cases for edge intelligence in IoT are also presented.
Best Practices for On-Demand HPC in Enterprisesgeetachauhan
Traditionally HPC has been popular in Scientific domains, but not in most other Enterprises. With the advent of on-demand-HPC in cloud and growing adoption of Deep Learning, HPC should now be a standard platform for any Enterprise leading with AI and Machine Learning. This session will cover the best practices for building your own on-demand HPC cluster for Enterprise workloads along with key use cases where Enterprises will benefit from HPC solution.
Deep learning @ Edge using Intel's Neural Compute Stickgeetachauhan
Talk @ Intel Global IoT DevFest, Nov 2017
The new generation of hardware accelerators are enabling rich AI driven, Intelligent IoT solutions @ the edge.
The talk showcased how to use Intel's latest Nervana Compute Stick for accelerating deep learning IoT solutions. It also covered use cases and code details for running Deep Learning models on Intel's Nervana Compute Stick.
Distributed deep learning optimizations - AI WithTheBestgeetachauhan
Learn how to optimize Tensorflow for your Intel CPU and techniques for distributed deep learning for both model training and inferencing. Talk @ AI WithTheBest
Distributed deep learning optimizationsgeetachauhan
The document discusses optimizations for distributed deep learning. It covers challenges like latency, cost and power consumption when scaling deep learning models. It then discusses specialized compute like Google TPUs and optimizations for CPU, GPU and inference workloads. Techniques like data parallelism, model parallelism, quantization and clustering are presented. Emerging areas like FPGA, neuromorphic and quantum computing are also mentioned.
Intel optimized tensorflow, distributed deep learninggeetachauhan
This document discusses optimizations for running TensorFlow on Intel CPUs for deep learning. It outlines techniques for compiling TensorFlow from source with CPU optimizations, using proper data formats and batch sizes, and reading data with queues to leverage multi-core CPUs. It also covers distributed deep learning using TensorFlow Estimators, parameter servers, and model parallelism to distribute graphs across multiple machines. Resources for further information on Intel optimizations, installing libraries, and distributed TensorFlow are provided.
How Deep Learning will change IoT to take us into new era of AI driven smart IoT devices with intelligence at the edge. Talk covers use cases and code details for running Tensorflow models on Intel Edison and Raspberry Pi. Slides from the talk given at Intel Iot With the Best 2017 conference
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
JavaLand 2024: Application Development Green Masterplan
Profiling PyTorch for Efficiency & Sustainability
1. P R O F I L I N G P Y T O R C H F O R
E F F I C I E N C Y & S U S T A I N A B I L I T Y
N O V 1 7 , 2 0 2 1
G E E T A C H A U H A N
P Y T O R C H P A R T N E R E N G I N E E R I N G
M E T A A I
2. PyTorch Profile r Talk
A G E N D A
0 1
G P U P E R F O R M A N C E T U N I N G
0 2
P Y T O R C H P R O F I L E R
0 3
T I M E L I N E T R A C I N G
0 4
O P T I M I Z A T I O N E X A M P L E S
0 5
F U R T U R E : S U S T A I N A B L E A I
4. PyTorch Profile r Talk
Optimized for single thread performance
- Majority of chip area is control logic & caches
Complex and deep out-of-order pipelines
- Extract instruction level parallelism
The brain
- Job is to keep the accelerator busy
CPU GPU
Optimized for throughput of data-parallel problems
- Majority of chip area is functional units
Simple, relatively slow in-order pipelines
- Achieves much higher total throughput
Accelerator attached via PCIe
- Order of magnitude faster but off to the side
A DIFFERENT MENTAL MODEL REQUIRED
G P U P E R F O R M A N C E T U N I N G
5. PyTorch Profile r Talk
Composed of Streaming
Multiprocessors (SMs)
Volta V100: 80x SMs
Ampere A100: 108 SMs
DGX A100 with 8 GPUs:
864 SMs vs 128 CPU cores
NVIDIA Volta V100 GPU
G P U P E R F O R M A N C E T U N I N G
6. PyTorch Profile r Talk
G P U P E R F O R M A N C E T U N I N G
64x FP32 units
64x INT, 32x FP64, 32x LD/ST
8x Tensor Cores
5120 (6920 ON A100)
FP32 EXECUTION UNITS
PER GPU
Streaming Multiprocessor
7. PyTorch Profile r Talk
• Excessive CPU/GPU interactions – e.g. for loop launching GPU operations
- Dominated by launch overheads
• Short GPU kernel durations – e.g. small inputs
- Need enough data to feed 10s of thousands of threads
• CPU overheads and I/O bottlenecks are starving the GPU
- Small operations on the CPU can quickly become dominant
• Framework inefficiencies
- E.g. unnecessary copies and hidden CPU-side overheads
VISIBILITY IS KEY
G P U P E R F O R M A N C E T U N I N G
Common Pitfalls
8. PyTorch Profiler
W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
9. CONTRIBUTED BY MICROSOFT &
FACEBOOK
• PyTorch and GPU level information
• Automatic bottleneck detection
• Actionable performance
recommendations
• Data Scientist friendly lifecycle and tools
• TensorBoard Plugin - chrome traces
visualization
• OSS Kineto library - built on CUPTI
• Easy-to-use python API
• VS Code integration
libkineto
PyTorch Profiler
libCUPTI
PyTorch Process
aten operators
Python
C++
CUDA
TensorBoard
Python Events
GPU 1 GPU 2 GPU n
…
NVIDIA Driver
OS
Profiler
Plugin
CUDA Activities
CPU operators
Queue GPU ops
Traces
CPU operators
Traces
T H E P Y T O R C H P R O F I L E R
10. https://pytorch.org/tutorials/recipes/recipes/profiler.html
import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(record_shapes=True) as prof:
with profiler.record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Base Usage
11. T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
12. T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
13. • When to trigger
• How many steps to profile
• Which activities to profile
• Results callable handler
• Extra metadata, eg shapes, stacks, memory
• Output options eg Chrome tracing , TensorBoard
T H E P Y T O R C H P R O F I L E R
Advanced
18. PyTorch Profile r Talk
T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
19. PyTorch Profile r Talk
T I M E L I N E T R A C I N G
Chrome Trace Viewer: CPU and GPU timelines
20. PyTorch Profile r Talk
• Can leave in permanently, no perf overhead
T I M E L I N E T R A C I N G
21. PyTorch Profile r Talk
T I M E L I N E T R A C I N G
See how CPU and GPU ops are connected
22. PyTorch Profile r Talk
Nvidia-smi shows
86% utilization
But.. only a
fraction of SMs are
actually used by
these kernels!
T I M E L I N E T R A C I N G
Inspect stats for individual activities
23. PyTorch Profile r Talk
Looks much better
after increasing input
sizes
T I M E L I N E T R A C I N G
Inspect stats for individual activities
24. Trace Analysis
E x a m p l e s f r o m M e t a w o r k l o a d s
#thanks to
Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan
for examples
25. PyTorch Profile r Talk
Issue:
1. Large periods of GPU inactivity
2. Trace does not show why
Solution:
1. Use record_function to reveal
bottlenecks on CPU
2. Parallelize CPU operations
3. Overlap CPU and GPU operations
temp = ""
num_substr = len(emb[k])
with record_function("## join_string {} ##".format(num_substr)):
temp = ",".join(str(x) for x in emb[k]) # string concatenation
with record_function("## append_record_in_else ##"):
records.append(f"{input_df.id[i + k]}t{temp}n") # list append
T R A C E A N A L Y S I S
Anti-pattern: Long GPU idle time
26. PyTorch Profile r Talk
A F T E R
def on_step(self, task) -> None:
...
with torch.no_grad():
torch._foreach_mul_(
self.ema_model_state_list, self.decay)
torch._foreach_add_(
self.ema_model_state_list,
self.param_list,
alpha=(1 - self.decay))
First issue:
• Exponential moving avg hook function has a
loop – CPU bottleneck
• Can rewrite using torch._foreach ops – loop
now on GPU
EMA HOOK 100X FASTER
ITERATION TIME: 860MS -> 770MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def on_step(self, task) -> None:
...
with torch.no_grad():
it = model_state_iterator(task.base_model)
# iterate on every name & param
for name, param in it:
s = self.state.ema_model_state
s[name] = self.decay * s[name] +
(1 – self.decay) *
param.to(device= self.device)
27. PyTorch Profile r Talk
Second issue:
• Optimizer step uses a naïve implementation
of RMSProp
• PyTorch provides an optimized multi-tensor
version – using torch._foreach
• Switch to optimized version!
OPTIMIZER 12X FASTER
ITERATION TIME: 770MS -> 600MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def prepare(self, param_groups):
self.optimizer = RMSpropTFV2Optimizer(
param_groups,
…
A F T E R
import torch.optim._multi_tensor as optim_mt
def prepare(self, param_groups):
self.optimizer = optim_mt.RMSprop(
param_groups,
…
28. PyTorch Profile r Talk
Third issue:
• Forward & backward pass dominated
by SyncBatchNorm
• 84x SyncBatchNorm in fwd pass
• 3x ncclAllGather per SyncBatchNorm
• Another 2x ncclAllReduce per
SyncBatchNorm in bwd pass
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU interactions
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
29. PyTorch Profile r Talk
From 2.4 req/s to 1,400+ req/s
CPU inference
torch.set_num_threads(1)
Intel IPEX
Quantization
GPU inference on 1 T4 GPU
model.half()
DistilBERT
Increase batch size
Do not overpad
Faster Transformer
T R A C E A N A L Y S I S
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
BERT PERFORMANCE OPTIMIZATION CASE STUDY
• From 2.4 req/s to 1,400+ req/s
• CPU inference
• torch.set_num_threads(1)
• Intel IPEX
• Quantization
• GPU inference on 1 T4 GPU
• model.half()
• DistilBERT
• Increase batch size
• Do not overpad
• Faster Transformer
Throughput P99
BERT
unoptimized
bs=1
70.67 seq/s 20.44ms
BERT
model.half()
bs=8
359 seq/s 23.58ms
DistilBERT
model.half()
bs=16
689 seq/s 22.8ms
BERT Faster
Transformer
885 seq/s 19.83ms
DistilBERT no
padding
model.half()
bs=32
1423 seq/s 19.7ms
32. PyTorch Profile r Talk
M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
33. • Platform level caching – 6.7x
improvements
• GPU Acceleration – unlocks 10.1x
energy efficiency
• Algorithmic Optimizations – 10x
improvements
O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
34. 1. Data Utilization Efficiency:
Data Scaling & Sampling, Data perishability
2. Experimentation and Training Efficiency:
NAS, HPO, Multi-Objective Optimizations,
Resource Efficient Architectures
3. Efficient Environment Scalable Infrastructure:
Carbon efficient scheduling, On-device Learning, …
4. Develop easy to adopt Telemetry:
Measure and publish,
Carbon impact statement & model cards
S U S T A I N B I L I T Y M I N D S E T
https://arxiv.org/pdf/2111.00364.pdf
Source: https://docs.cohere.ai/environmental-impact
35. PyTorch Profile r Talk
• What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/
• Introducing PyTorch Profiler:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
• Profiler: https://pytorch.org/docs/stable/profiler.html
• Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html
• VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/
• PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS:
https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731
• Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance-
batch-size-with-pytorch-profiler/
• Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples
• PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch-
lightning/blob/master/pl_examples/basic_examples/profiler_example.py
• Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf
• Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact
R E F E R E N C E S