Balancing Power &
Performance for
Mobile Applications
Aravind Raghavan
Staff Engineer
Qualcomm Technologies, Inc.
2
• APIs to manage task execution across
CPU, GPU and DSP
• Efficient data management between
compute cores
• Provide abstraction from low-level system
calls and data management
• Integrates with existing development
environment
• C++11, OpenCL, OpenGL, Qualcomm®
Hexagon™ SDK (DSP)
Userspace
Application
Heterogeneous Compute SDK
Snapdragon
CPU GPU DSP
Patterns
Affinity
Tasks
Buffers
Qualcomm® Snapdragon™ Heterogeneous Compute SDK
What is it?
Qualcomm Snapdragon and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries
3
Kernel
Computation to be executed on CPU/GPU/DSP
Kernel
Your Existing
Algorithms
• Actual unit of work
• In Beta
◦ Poly Kernel: Write all, Run Somewhere
◦ Point Kernel: Write all, Run Everywhere
OpenCL or OpenGL
Kernel
GPU KernelCPU Kernel
DSP Kernel written using
Hexagon SDK
DSP Kernel
C++ functors, lambda,
or function pointers
Attributes: Affinity, Blocking
4
Kernel
Code Sample
Function doubles
values of an
input vector
5
Kernel
Code Sample
Create a CPU kernel
for vector_double
6
Affinity
• CPU core selection APIs
• Use APIs with
◦ Standalone functions
◦ Tasks, CPU Kernel abstractions
• Benefit: improve performance and save
power
Control placement of algorithm execution
Userspace
Application
Heterogeneous Compute SDK
Snapdragon
GPU
DSP
CPU
Patterns
Affinity
Tasks
Buffers
7
Affinity
Control placement of algorithm execution
Encapsulate your existing code
execute(settings, fn, fn_args)
Standalone functions
Use with Tasks/CPU Kernel
set_big()
set_little()
Task, Kernel
• Location: Choose CPU where program construct should run
• Pinning: Determines if thread can migrate freely among cores
• Mode: Override or adhere to local affinity settings
8
Affinity
Code Sample
Tell CPU Kernel to
run in big Cluster
9
Patterns
Simplify Parallel Programming
Userspace
Application
Heterogeneous Compute SDK
Data
Algorithm
Snapdragon
GPU
DSP
Data
Algorithm
Data
Algorithm
CPU
• Commonly used parallel CPU programming
constructs
◦ Data Parallelism
◦ Multi-branch recursion
(divide & conquer)
◦ Pipeline computation
• Optimize parallel execution further using
Pattern Tuners
• In Beta: Some patterns can execute across
CPU, GPU, DSP
Patterns
Affinity
Tasks
Buffers
10
Pattern Name Description
hetcompute::pfor_each
Processes the elements of a collection in
parallel
hetcompute::ptransform
Performs a map operation on all elements of a
collection, returns a new collection
hetcompute::pscan
Performs and in-place parallel prefix operation
for all elements of a collection
hetcompute::preduce
Combines all the elements in a collection into
one using an associative binary operator
hetcompute::pdivide_and_conquer
Divides problems into sub-problems, solves
them, and merges their solutions in parallel
hetcompute::pipeline
A sequence of processing stages that can
execute concurrently on a data stream
Patterns
Parallelize commonly occurring algorithmic constructs
11
Pattern Tuner API Description
set_chunk_size(size_t sz)
Smallest granularity for load balancing. If computational kernel is
small, set a large chunk size to minimize the synchronization
overhead.
set_max_doc(size_t doc)
Max degree of concurrency, default is set to the number of
available device threads
set_static() Use a static chunking algorithm as the parallelization backend
set_dynamic()
Use a dynamic workload balancing algorithm as the
parallelization backend
set_shape(pattern::shape) Set shape of workload distribution across range of work-items
set_cpu_load()
set_gpu_load()
set_dsp_load()
Set fraction of workload to schedule on CPU, GPU, DSP
Programmer Hints: Pattern Tuner
Customize parallel algorithm execution for finer optimizations
12
Patterns
Code Sample
Parallelize
vector_double across
all CPUs
13
Tasks
Fundamental unit of asynchrony
• Independent units of work that can be
executed asynchronously in CPU, GPU,
DSP
• Computation bound with data
◦ Control: C++ Lambda & Functions, Kernel,
Patterns, …
◦ Data: Buffers, Function arguments, …
• Easy task management
• Groups bundle set of related tasks
Userspace
Application
Heterogeneous Compute SDK
Snapdragon
CPU DSPGPU
CPU Task
Control Data
GPU Task
Control Data
DSP Task
Control Data
Patterns
Affinity
Tasks
Buffers
14
Task APIs Description
hetcompute::create_task Creates a Heterogeneous Compute task.
t1->then(t2) Control Dependency from t1 to t2
t2->bind_all(t1)
Data Dependency from t1 to t2
t->launch()
Launches a task into Heterogeneous Compute
Runtime
t->wait_for()
Waits for the task to complete (Blocking call)
t->cancel() Cancel a launched task. Should be used with
hetcompute::abort_on_cancel() to cancel
running task.
Tasks
Fundamental unit of asynchrony
15
Tasks
Code Sample
Create a task
with
CPU Kernel
16
Tasks
Code Sample
Launch task with
HetCompute
Runtime
17
Tasks
Code Sample
Wait for Task
completion
18
Buffers
Heterogeneous Memory Management
• Managed array-like data store for user-
defined data-types
• Abstracts specialized memory for OpenGL,
OpenCL, Textures, ION
• Accessible by CPU, GPU, DSP Tasks and
Host application
• APIs move and synchronize data across
compute cores efficiently
Userspace
Application
Heterogeneous Compute SDK
Snapdragon
CPU DSPGPU
DSP Task
Control
GPU Task
Control
CPU Task
Control
Buffers
Host buffer access
Patterns
Affinity
Tasks
Buffers
19
Buffer APIs Description
hetcompute::create_buffer<T>
Creates a Heterogeneous Compute buffer. Supports
different variants preallocated memory, ION/GL/CL
Memory
hetcompute::buffer_ptr<T>
Smart pointer to managed buffer, has a std:array like
interface
acquire_ro()
Acquire the buffer with read-only access. To be used
by application host code
acquire_wi()
Acquire the buffer with write-invalidate access. If
successful, the previous contents of the buffer are
lost. To be used by application host code
acquire_rw()
Acquire the buffer with write access. To be used by
application host code
release() Releases the acquired buffer
Buffers
Key APIs
20
Buffers
Code Sample
Create a buffer of
int
21
Buffers
Code Sample
Acquire buffer in
application and fill
data
22
Buffers
Code Sample
Create a CPU
kernel task and
bind buffer data
23
Buffers
Code Sample
Launch Task.
Task always has read-
write access over buffer
24
Buffers
Heterogeneous Memory Management
ION
OpenCL/GL
host-
accessible
Host Memory
big CPU
LITTLE
CPU
GPU
DSP
Memory
Accessibility CPU GPU DSP
Host Memory Yes No No
OpenCL/GL host-
accessible
Yes Yes No
ION Yes Yes Yes
Using ION Memory as backing store can
improve performance (avoids copy)
25
Power
Optimization
SDK
26
Approaches to Power Management
• Standard system power management
• Acceptable for many use cases
• Generic solution, leaving opportunity for
power optimization with some algorithms
Reactive vs Proactive
SystemApplication
Workload
Reactive Model
Power/Thermal
Adjustment
SystemApplication
Workload Power/Thermal
Adjustment
Proactive Model
Direct Recommendation
• Developer-driven power management
• Control power consumption during algorithm
execution
• Developer understanding of algorithm and system
can lead to additional power optimization
opportunity
27
Power Optimization SDK
• APIs to provide granular control of core
frequencies
• Developers request power control for their
algorithm
• Requests subject to system constraints
◦ Does not override system
◦ Interfaces with Perflock
• Static and Dynamic power management APIs
for CPU and GPU
Run-time power and performance control for CPU and GPU
Userspace
Application
Power Optimization SDK
Snapdragon
CPU GPU DSP
Static Dynamic
Perflock
28
Power Optimization SDK
• One API call to control CPU and GPU clock
frequency
• Choose one of 5 predefined modes
• Define the duration the mode should be active
• Target the device (big CPU, LITTLE CPU, GPU)
Static APIs
Userspace
Application
Power Optimization SDK
Snapdragon
CPU GPU DSP
Static Dynamic
Perflock
29
Using the Power Optimization SDK
Static APIs
Power Mode Description
Normal Default system state
Efficient Close to best performance with power savings
Performance Burst All cores at max frequency for short duration
Saver Half of peak performance
Window Set minimum and maximum frequency window
30
Set big Cluster to
operate between 50-
60% of max
frequency index
Power Optimization SDK – Static API
Code Samples
31
Power Optimization SDK
• Self-regulates performance while trying to
minimize energy consumption
• Realtime Applications – Games, Streaming,
Video
• Currently supported only in BIG Cluster
Dynamic APIs
Userspace
Application
Power Optimization SDK
Snapdragon
CPU GPU DSP
Static Dynamic
Perflock
Experimental
32
Using the Power Optimization SDK
Dynamic APIs
Power Mode Description
set_goal()
Start the automatic performance/power regulation
mode.
regulate()
Application feedback to the SDK, this is used by the
SDK to self-regulate the system to achieve the
performance and save power
clear_goal() Terminate the regulation process
Experimental
33
Goal: # of elements
application wants to
process in a
millisecond
Power Optimization SDK – Dynamic API
Code Samples Experimental
34
Application processing
Track # of elements
processed per
millisecond
Power Optimization SDK – Dynamic API
Code Samples Experimental
35
Power Optimization SDK – Dynamic API
Code Samples Experimental
Allow API to make
adjustments based on
number of elements being
processed per millisecond
36
Power Optimization SDK – Dynamic API
Code Samples Experimental
Put the system back
to Normal state
37
Power
Improvement
Case Study
Using Heterogeneous Compute
SDK and Power Optimization
SDK to Improve Power
38
Lowering Power consumption
Heterogeneous execution is key for managing power/thermal
Using more cores and lowering their frequency can get
the same performance and consume less power
39
Case Study: Find all primes under 10 million
Sequential variant
40
Case Study: Find all primes under 10 million
Sequential variant
is_prime has some
optimizations already
like skipping even
numbers,
Run only through
sqrt(n)
41
Using Profiler
Case Study: Find all primes under 10 million
# of Cores
CPU Utilization
CPU Frequency
Processing Time
CPU Power
42
Sequential variant
Case Study: Find all primes under 10 million
# of Cores 1
CPU Utilization 100%
CPU Frequency Max(1.9GHz)
Processing Time 34 seconds
CPU Power 125mW
43
Sequential variant
Case Study: Find all primes under 10 million
# of Cores 1
CPU Utilization 100%
CPU Frequency Max(1.9GHz)
Processing Time 34 seconds
CPU Power 125mW
44
Sequential variant
Case Study: Find all primes under 10 million
# of Cores 1
CPU Utilization 100%
CPU Frequency Max(1.9GHz)
Processing Time 34 seconds
CPU Power 125mW
45
Sequential variant
Case Study: Find all primes under 10 million
# of Cores 1
CPU Utilization 100%
CPU Frequency Max(1.9GHz)
Processing Time 34 seconds
CPU Power 125mW
34 sec
46
Sequential variant
Case Study: Find all primes under 10 million
# of Cores 1
CPU Utilization 100%
CPU Frequency Max(1.9GHz)
Processing Time 34 seconds
CPU Power 125mW
47
Case Study: Find all primes under 10 million
What can we parallelize?
48
Case Study: Find all primes under 10 million
What can we parallelize?
Iterative loop -
pfor_each can be used
to parallelize
49
Case Study: Find all primes under 10 million
Parallel variant
Simple parallel version –
could improve work
distribution between
big/LITTLE and chunk
size
50
Parallel variant
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Max(1.90/2.36 GHz)
Processing Time 6.2 seconds
CPU Power 281 mW
51
Parallel variant
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Max(1.90/2.36 GHz)
Processing Time 6.2 seconds
CPU Power 281 mW
52
Parallel variant
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Max(1.90/2.36 GHz)
Processing Time 6.2 seconds
CPU Power 281 mW
53
Parallel variant
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Max(1.90/2.36 GHz)
Processing Time 6.2 seconds
CPU Power 281 mW
~6sec
54
Parallel variant
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Max(1.90/2.36 GHz)
Processing Time 6.2 seconds
CPU Power 281 mW
55
Compare Sequential and Parallel variant
Case Study: Find all primes under 10 million
Sequential Parallel
# of Cores 1 8
CPU Utilization 100% 100%
CPU Frequency Max (1.90 GHz) Max (1.90/2.36 GHz)
Processing Time 34 sec 6.2 sec (82%)
CPU Power 125 mW 281 mW (55%)
56
Case Study: Find all primes under 10 million
Parallel variant
Can we use
Power SDK to fine-
tune Power
Consumption?
57
Case Study: Find all primes under 10 million
Parallel variant with Power Tuning
Goal: Max Power Savings
Request big and LITTLE
cluster run at 0-15% of max
frequency
58
Parallel variant with Power Tunings
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Min(512/652 MHz)
Processing Time 26 seconds
CPU Power 82 mW
59
Parallel variant with Power Tunings
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Min(512/652 MHz)
Processing Time 26 seconds
CPU Power 82 mW
60
Parallel variant with Power Tunings
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Min(512/652 MHz)
Processing Time 26 seconds
CPU Power 82 mW
61
Parallel variant with Power Tunings
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Min(512/652 MHz)
Processing Time 26 seconds
CPU Power 82 mW
26sec
62
Parallel variant with Power Tunings
Case Study: Find all primes under 10 million
# of Cores 8
CPU Utilization 100%
CPU Frequency Min(512/652 MHz)
Processing Time 26 seconds
CPU Power 82 mW
63
Comparison chart - Recap
Case Study: Find all primes under 10 million
Sequential Parallel
Parallel with
Power SDK (min
freq)
# of Cores 1 8 8
CPU Utilization 100% 100% 100%
CPU Frequency Max(1.90 GHz)
Max(1.90/2.36
GHz)
Min(512/652 MHz)
Processing Time 34 sec 6.2 sec (82%) 26 sec
CPU Power 125 mW 281 mW (55%) 82 mW
64
Comparison chart
Case Study: Find all primes under 10 million
Sequential Parallel
Parallel with
Power SDK (min
freq)
# of Cores 1 8 8
CPU Utilization 100% 100% 100%
CPU Frequency Max(1.90 GHz)
Max(1.90/2.36
GHz)
Min(512/652 MHz)
Processing Time 34 sec 6.2 sec (82%) 26 sec (23%)
CPU Power 125 mW 281 mW (55%) 82 mW (34%)
65
Choose the optimal power-performance
Case Study: Find all primes under 10 million
Sequential Parallel
Parallel with
Power SDK (min
freq)
# of Cores 1 8 8
CPU Utilization 100% 100% 100%
CPU Frequency Max(1.90 GHz)
Max(1.90/2.36
GHz)
Min(512/652 MHz)
Processing Time 34 sec 6.2 sec (82%) 26 sec (23%)
CPU Power 125 mW 281 mW (55%) 82 mW (34%)
66
Lowering Power consumption
Strategy for power savings
• Using more cores and lowering their frequency allows us to get the same
performance with lower energy
• Choosing right compute device is the key to lowering power
◦ Big/LITTLE/GPU/DSP
• Strategy to reduce power maintaining performance
Extract Parallelism
Control placement of
algorithm execution
onto right device
Power Tuning
using Power
SDK
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank you!
Nothing in these materials is an offer to sell any of the
components or devices referenced herein.
©2018 Qualcomm Technologies, Inc. and/or its affiliated
companies. All Rights Reserved.
Qualcomm, Snapdragon and Hexagon are trademarks of
Qualcomm Incorporated, registered in the United States
and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective
owners.
References in this presentation to “Qualcomm” may mean Qualcomm
Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
applicable. Qualcomm Incorporated includes Qualcomm’s licensing
business, QTL, and the vast majority of its patent portfolio. Qualcomm
Technologies, Inc., a wholly-owned subsidiary of Qualcomm
Incorporated, operates, along with its subsidiaries, substantially all of
Qualcomm’s engineering, research and development functions, and
substantially all of its product and services businesses, including its
semiconductor business, QCT.

Balancing Power & Performance Webinar

  • 1.
    Balancing Power & Performancefor Mobile Applications Aravind Raghavan Staff Engineer Qualcomm Technologies, Inc.
  • 2.
    2 • APIs tomanage task execution across CPU, GPU and DSP • Efficient data management between compute cores • Provide abstraction from low-level system calls and data management • Integrates with existing development environment • C++11, OpenCL, OpenGL, Qualcomm® Hexagon™ SDK (DSP) Userspace Application Heterogeneous Compute SDK Snapdragon CPU GPU DSP Patterns Affinity Tasks Buffers Qualcomm® Snapdragon™ Heterogeneous Compute SDK What is it? Qualcomm Snapdragon and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries
  • 3.
    3 Kernel Computation to beexecuted on CPU/GPU/DSP Kernel Your Existing Algorithms • Actual unit of work • In Beta ◦ Poly Kernel: Write all, Run Somewhere ◦ Point Kernel: Write all, Run Everywhere OpenCL or OpenGL Kernel GPU KernelCPU Kernel DSP Kernel written using Hexagon SDK DSP Kernel C++ functors, lambda, or function pointers Attributes: Affinity, Blocking
  • 4.
  • 5.
    5 Kernel Code Sample Create aCPU kernel for vector_double
  • 6.
    6 Affinity • CPU coreselection APIs • Use APIs with ◦ Standalone functions ◦ Tasks, CPU Kernel abstractions • Benefit: improve performance and save power Control placement of algorithm execution Userspace Application Heterogeneous Compute SDK Snapdragon GPU DSP CPU Patterns Affinity Tasks Buffers
  • 7.
    7 Affinity Control placement ofalgorithm execution Encapsulate your existing code execute(settings, fn, fn_args) Standalone functions Use with Tasks/CPU Kernel set_big() set_little() Task, Kernel • Location: Choose CPU where program construct should run • Pinning: Determines if thread can migrate freely among cores • Mode: Override or adhere to local affinity settings
  • 8.
    8 Affinity Code Sample Tell CPUKernel to run in big Cluster
  • 9.
    9 Patterns Simplify Parallel Programming Userspace Application HeterogeneousCompute SDK Data Algorithm Snapdragon GPU DSP Data Algorithm Data Algorithm CPU • Commonly used parallel CPU programming constructs ◦ Data Parallelism ◦ Multi-branch recursion (divide & conquer) ◦ Pipeline computation • Optimize parallel execution further using Pattern Tuners • In Beta: Some patterns can execute across CPU, GPU, DSP Patterns Affinity Tasks Buffers
  • 10.
    10 Pattern Name Description hetcompute::pfor_each Processesthe elements of a collection in parallel hetcompute::ptransform Performs a map operation on all elements of a collection, returns a new collection hetcompute::pscan Performs and in-place parallel prefix operation for all elements of a collection hetcompute::preduce Combines all the elements in a collection into one using an associative binary operator hetcompute::pdivide_and_conquer Divides problems into sub-problems, solves them, and merges their solutions in parallel hetcompute::pipeline A sequence of processing stages that can execute concurrently on a data stream Patterns Parallelize commonly occurring algorithmic constructs
  • 11.
    11 Pattern Tuner APIDescription set_chunk_size(size_t sz) Smallest granularity for load balancing. If computational kernel is small, set a large chunk size to minimize the synchronization overhead. set_max_doc(size_t doc) Max degree of concurrency, default is set to the number of available device threads set_static() Use a static chunking algorithm as the parallelization backend set_dynamic() Use a dynamic workload balancing algorithm as the parallelization backend set_shape(pattern::shape) Set shape of workload distribution across range of work-items set_cpu_load() set_gpu_load() set_dsp_load() Set fraction of workload to schedule on CPU, GPU, DSP Programmer Hints: Pattern Tuner Customize parallel algorithm execution for finer optimizations
  • 12.
  • 13.
    13 Tasks Fundamental unit ofasynchrony • Independent units of work that can be executed asynchronously in CPU, GPU, DSP • Computation bound with data ◦ Control: C++ Lambda & Functions, Kernel, Patterns, … ◦ Data: Buffers, Function arguments, … • Easy task management • Groups bundle set of related tasks Userspace Application Heterogeneous Compute SDK Snapdragon CPU DSPGPU CPU Task Control Data GPU Task Control Data DSP Task Control Data Patterns Affinity Tasks Buffers
  • 14.
    14 Task APIs Description hetcompute::create_taskCreates a Heterogeneous Compute task. t1->then(t2) Control Dependency from t1 to t2 t2->bind_all(t1) Data Dependency from t1 to t2 t->launch() Launches a task into Heterogeneous Compute Runtime t->wait_for() Waits for the task to complete (Blocking call) t->cancel() Cancel a launched task. Should be used with hetcompute::abort_on_cancel() to cancel running task. Tasks Fundamental unit of asynchrony
  • 15.
    15 Tasks Code Sample Create atask with CPU Kernel
  • 16.
    16 Tasks Code Sample Launch taskwith HetCompute Runtime
  • 17.
  • 18.
    18 Buffers Heterogeneous Memory Management •Managed array-like data store for user- defined data-types • Abstracts specialized memory for OpenGL, OpenCL, Textures, ION • Accessible by CPU, GPU, DSP Tasks and Host application • APIs move and synchronize data across compute cores efficiently Userspace Application Heterogeneous Compute SDK Snapdragon CPU DSPGPU DSP Task Control GPU Task Control CPU Task Control Buffers Host buffer access Patterns Affinity Tasks Buffers
  • 19.
    19 Buffer APIs Description hetcompute::create_buffer<T> Createsa Heterogeneous Compute buffer. Supports different variants preallocated memory, ION/GL/CL Memory hetcompute::buffer_ptr<T> Smart pointer to managed buffer, has a std:array like interface acquire_ro() Acquire the buffer with read-only access. To be used by application host code acquire_wi() Acquire the buffer with write-invalidate access. If successful, the previous contents of the buffer are lost. To be used by application host code acquire_rw() Acquire the buffer with write access. To be used by application host code release() Releases the acquired buffer Buffers Key APIs
  • 20.
  • 21.
    21 Buffers Code Sample Acquire bufferin application and fill data
  • 22.
    22 Buffers Code Sample Create aCPU kernel task and bind buffer data
  • 23.
    23 Buffers Code Sample Launch Task. Taskalways has read- write access over buffer
  • 24.
    24 Buffers Heterogeneous Memory Management ION OpenCL/GL host- accessible HostMemory big CPU LITTLE CPU GPU DSP Memory Accessibility CPU GPU DSP Host Memory Yes No No OpenCL/GL host- accessible Yes Yes No ION Yes Yes Yes Using ION Memory as backing store can improve performance (avoids copy)
  • 25.
  • 26.
    26 Approaches to PowerManagement • Standard system power management • Acceptable for many use cases • Generic solution, leaving opportunity for power optimization with some algorithms Reactive vs Proactive SystemApplication Workload Reactive Model Power/Thermal Adjustment SystemApplication Workload Power/Thermal Adjustment Proactive Model Direct Recommendation • Developer-driven power management • Control power consumption during algorithm execution • Developer understanding of algorithm and system can lead to additional power optimization opportunity
  • 27.
    27 Power Optimization SDK •APIs to provide granular control of core frequencies • Developers request power control for their algorithm • Requests subject to system constraints ◦ Does not override system ◦ Interfaces with Perflock • Static and Dynamic power management APIs for CPU and GPU Run-time power and performance control for CPU and GPU Userspace Application Power Optimization SDK Snapdragon CPU GPU DSP Static Dynamic Perflock
  • 28.
    28 Power Optimization SDK •One API call to control CPU and GPU clock frequency • Choose one of 5 predefined modes • Define the duration the mode should be active • Target the device (big CPU, LITTLE CPU, GPU) Static APIs Userspace Application Power Optimization SDK Snapdragon CPU GPU DSP Static Dynamic Perflock
  • 29.
    29 Using the PowerOptimization SDK Static APIs Power Mode Description Normal Default system state Efficient Close to best performance with power savings Performance Burst All cores at max frequency for short duration Saver Half of peak performance Window Set minimum and maximum frequency window
  • 30.
    30 Set big Clusterto operate between 50- 60% of max frequency index Power Optimization SDK – Static API Code Samples
  • 31.
    31 Power Optimization SDK •Self-regulates performance while trying to minimize energy consumption • Realtime Applications – Games, Streaming, Video • Currently supported only in BIG Cluster Dynamic APIs Userspace Application Power Optimization SDK Snapdragon CPU GPU DSP Static Dynamic Perflock Experimental
  • 32.
    32 Using the PowerOptimization SDK Dynamic APIs Power Mode Description set_goal() Start the automatic performance/power regulation mode. regulate() Application feedback to the SDK, this is used by the SDK to self-regulate the system to achieve the performance and save power clear_goal() Terminate the regulation process Experimental
  • 33.
    33 Goal: # ofelements application wants to process in a millisecond Power Optimization SDK – Dynamic API Code Samples Experimental
  • 34.
    34 Application processing Track #of elements processed per millisecond Power Optimization SDK – Dynamic API Code Samples Experimental
  • 35.
    35 Power Optimization SDK– Dynamic API Code Samples Experimental Allow API to make adjustments based on number of elements being processed per millisecond
  • 36.
    36 Power Optimization SDK– Dynamic API Code Samples Experimental Put the system back to Normal state
  • 37.
    37 Power Improvement Case Study Using HeterogeneousCompute SDK and Power Optimization SDK to Improve Power
  • 38.
    38 Lowering Power consumption Heterogeneousexecution is key for managing power/thermal Using more cores and lowering their frequency can get the same performance and consume less power
  • 39.
    39 Case Study: Findall primes under 10 million Sequential variant
  • 40.
    40 Case Study: Findall primes under 10 million Sequential variant is_prime has some optimizations already like skipping even numbers, Run only through sqrt(n)
  • 41.
    41 Using Profiler Case Study:Find all primes under 10 million # of Cores CPU Utilization CPU Frequency Processing Time CPU Power
  • 42.
    42 Sequential variant Case Study:Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  • 43.
    43 Sequential variant Case Study:Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  • 44.
    44 Sequential variant Case Study:Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  • 45.
    45 Sequential variant Case Study:Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW 34 sec
  • 46.
    46 Sequential variant Case Study:Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  • 47.
    47 Case Study: Findall primes under 10 million What can we parallelize?
  • 48.
    48 Case Study: Findall primes under 10 million What can we parallelize? Iterative loop - pfor_each can be used to parallelize
  • 49.
    49 Case Study: Findall primes under 10 million Parallel variant Simple parallel version – could improve work distribution between big/LITTLE and chunk size
  • 50.
    50 Parallel variant Case Study:Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  • 51.
    51 Parallel variant Case Study:Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  • 52.
    52 Parallel variant Case Study:Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  • 53.
    53 Parallel variant Case Study:Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW ~6sec
  • 54.
    54 Parallel variant Case Study:Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  • 55.
    55 Compare Sequential andParallel variant Case Study: Find all primes under 10 million Sequential Parallel # of Cores 1 8 CPU Utilization 100% 100% CPU Frequency Max (1.90 GHz) Max (1.90/2.36 GHz) Processing Time 34 sec 6.2 sec (82%) CPU Power 125 mW 281 mW (55%)
  • 56.
    56 Case Study: Findall primes under 10 million Parallel variant Can we use Power SDK to fine- tune Power Consumption?
  • 57.
    57 Case Study: Findall primes under 10 million Parallel variant with Power Tuning Goal: Max Power Savings Request big and LITTLE cluster run at 0-15% of max frequency
  • 58.
    58 Parallel variant withPower Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  • 59.
    59 Parallel variant withPower Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  • 60.
    60 Parallel variant withPower Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  • 61.
    61 Parallel variant withPower Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW 26sec
  • 62.
    62 Parallel variant withPower Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  • 63.
    63 Comparison chart -Recap Case Study: Find all primes under 10 million Sequential Parallel Parallel with Power SDK (min freq) # of Cores 1 8 8 CPU Utilization 100% 100% 100% CPU Frequency Max(1.90 GHz) Max(1.90/2.36 GHz) Min(512/652 MHz) Processing Time 34 sec 6.2 sec (82%) 26 sec CPU Power 125 mW 281 mW (55%) 82 mW
  • 64.
    64 Comparison chart Case Study:Find all primes under 10 million Sequential Parallel Parallel with Power SDK (min freq) # of Cores 1 8 8 CPU Utilization 100% 100% 100% CPU Frequency Max(1.90 GHz) Max(1.90/2.36 GHz) Min(512/652 MHz) Processing Time 34 sec 6.2 sec (82%) 26 sec (23%) CPU Power 125 mW 281 mW (55%) 82 mW (34%)
  • 65.
    65 Choose the optimalpower-performance Case Study: Find all primes under 10 million Sequential Parallel Parallel with Power SDK (min freq) # of Cores 1 8 8 CPU Utilization 100% 100% 100% CPU Frequency Max(1.90 GHz) Max(1.90/2.36 GHz) Min(512/652 MHz) Processing Time 34 sec 6.2 sec (82%) 26 sec (23%) CPU Power 125 mW 281 mW (55%) 82 mW (34%)
  • 66.
    66 Lowering Power consumption Strategyfor power savings • Using more cores and lowering their frequency allows us to get the same performance with lower energy • Choosing right compute device is the key to lowering power ◦ Big/LITTLE/GPU/DSP • Strategy to reduce power maintaining performance Extract Parallelism Control placement of algorithm execution onto right device Power Tuning using Power SDK
  • 67.
    Follow us on: Formore information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you! Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm, Snapdragon and Hexagon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.