Balancing Power & Performance Webinar

Balancing Power &
Performance for
Mobile Applications
Aravind Raghavan
Staff Engineer
Qualcomm Technologies, Inc.

2
• APIs to manage task execution across
CPU, GPU and DSP
• Efficient data management between
compute cores
• Provide abstraction from low-level system
calls and data management
• Integrates with existing development
environment
• C++11, OpenCL, OpenGL, Qualcomm®
Hexagon™ SDK (DSP)
Userspace
Application
Heterogeneous Compute SDK
Snapdragon
CPU GPU DSP
Patterns
Affinity
Tasks
Buffers
Qualcomm® Snapdragon™ Heterogeneous Compute SDK
What is it?
Qualcomm Snapdragon and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries

3
Kernel
Computation to be executed on CPU/GPU/DSP
Kernel
Your Existing
Algorithms
• Actual unit of work
• In Beta
◦ Poly Kernel: Write all, Run Somewhere
◦ Point Kernel: Write all, Run Everywhere
OpenCL or OpenGL
Kernel
GPU KernelCPU Kernel
DSP Kernel written using
Hexagon SDK
DSP Kernel
C++ functors, lambda,
or function pointers
Attributes: Affinity, Blocking

4
Kernel
Code Sample
Function doubles
values of an
input vector

5
Kernel
Code Sample
Create a CPU kernel
for vector_double

6
Affinity
• CPU core selection APIs
• Use APIs with
◦ Standalone functions
◦ Tasks, CPU Kernel abstractions
• Benefit: improve performance and save
power
Control placement of algorithm execution
Userspace
Application
Snapdragon
GPU
DSP
CPU
Patterns
Affinity
Tasks
Buffers

7
Affinity
Control placement of algorithm execution
Encapsulate your existing code
execute(settings, fn, fn_args)
Standalone functions
Use with Tasks/CPU Kernel
set_big()
set_little()
Task, Kernel
• Location: Choose CPU where program construct should run
• Pinning: Determines if thread can migrate freely among cores
• Mode: Override or adhere to local affinity settings

8
Affinity
Code Sample
Tell CPU Kernel to
run in big Cluster

9
Patterns
Simplify Parallel Programming
Userspace
Application
Data
Algorithm
Snapdragon
GPU
DSP
Data
Algorithm
Data
Algorithm
CPU
• Commonly used parallel CPU programming
constructs
◦ Data Parallelism
◦ Multi-branch recursion
(divide & conquer)
◦ Pipeline computation
• Optimize parallel execution further using
Pattern Tuners
• In Beta: Some patterns can execute across
CPU, GPU, DSP
Patterns
Affinity
Tasks
Buffers

10
Pattern Name Description
hetcompute::pfor_each
Processes the elements of a collection in
parallel
hetcompute::ptransform
Performs a map operation on all elements of a
collection, returns a new collection
hetcompute::pscan
Performs and in-place parallel prefix operation
for all elements of a collection
hetcompute::preduce
Combines all the elements in a collection into
one using an associative binary operator
hetcompute::pdivide_and_conquer
Divides problems into sub-problems, solves
them, and merges their solutions in parallel
hetcompute::pipeline
A sequence of processing stages that can
execute concurrently on a data stream
Patterns
Parallelize commonly occurring algorithmic constructs

11
Pattern Tuner API Description
set_chunk_size(size_t sz)
Smallest granularity for load balancing. If computational kernel is
small, set a large chunk size to minimize the synchronization
overhead.
set_max_doc(size_t doc)
Max degree of concurrency, default is set to the number of
available device threads
set_static() Use a static chunking algorithm as the parallelization backend
set_dynamic()
Use a dynamic workload balancing algorithm as the
parallelization backend
set_shape(pattern::shape) Set shape of workload distribution across range of work-items
set_cpu_load()
set_gpu_load()
set_dsp_load()
Set fraction of workload to schedule on CPU, GPU, DSP
Programmer Hints: Pattern Tuner
Customize parallel algorithm execution for finer optimizations

12
Patterns
Code Sample
Parallelize
vector_double across
all CPUs

13
Tasks
Fundamental unit of asynchrony
• Independent units of work that can be
executed asynchronously in CPU, GPU,
DSP
• Computation bound with data
◦ Control: C++ Lambda & Functions, Kernel,
Patterns, …
◦ Data: Buffers, Function arguments, …
• Easy task management
• Groups bundle set of related tasks
Userspace
Application
Snapdragon
CPU DSPGPU
CPU Task
Control Data
GPU Task
Control Data
DSP Task
Control Data
Patterns
Affinity
Tasks
Buffers

14
Task APIs Description
hetcompute::create_task Creates a Heterogeneous Compute task.
t1->then(t2) Control Dependency from t1 to t2
t2->bind_all(t1)
Data Dependency from t1 to t2
t->launch()
Launches a task into Heterogeneous Compute
Runtime
t->wait_for()
Waits for the task to complete (Blocking call)
t->cancel() Cancel a launched task. Should be used with
hetcompute::abort_on_cancel() to cancel
running task.
Tasks
Fundamental unit of asynchrony

15
Tasks
Code Sample
Create a task
with
CPU Kernel

16
Tasks
Code Sample
Launch task with
HetCompute
Runtime

17
Tasks
Code Sample
Wait for Task
completion

18
Buffers
Heterogeneous Memory Management
• Managed array-like data store for user-
defined data-types
• Abstracts specialized memory for OpenGL,
OpenCL, Textures, ION
• Accessible by CPU, GPU, DSP Tasks and
Host application
• APIs move and synchronize data across
compute cores efficiently
Userspace
Application
Snapdragon
CPU DSPGPU
DSP Task
Control
GPU Task
Control
CPU Task
Control
Buffers
Host buffer access
Patterns
Affinity
Tasks
Buffers

19
Buffer APIs Description
hetcompute::create_buffer<T>
Creates a Heterogeneous Compute buffer. Supports
different variants preallocated memory, ION/GL/CL
Memory
hetcompute::buffer_ptr<T>
Smart pointer to managed buffer, has a std:array like
interface
acquire_ro()
Acquire the buffer with read-only access. To be used
by application host code
acquire_wi()
Acquire the buffer with write-invalidate access. If
successful, the previous contents of the buffer are
lost. To be used by application host code
acquire_rw()
Acquire the buffer with write access. To be used by
application host code
release() Releases the acquired buffer
Buffers
Key APIs

20
Buffers
Code Sample
Create a buffer of
int

21
Buffers
Code Sample
Acquire buffer in
application and fill
data

22
Buffers
Code Sample
Create a CPU
kernel task and
bind buffer data

23
Buffers
Code Sample
Launch Task.
Task always has read-
write access over buffer

24
Buffers
Heterogeneous Memory Management
ION
OpenCL/GL
host-
accessible
Host Memory
big CPU
LITTLE
CPU
GPU
DSP
Memory
Accessibility CPU GPU DSP
Host Memory Yes No No
OpenCL/GL host-
accessible
Yes Yes No
ION Yes Yes Yes
Using ION Memory as backing store can
improve performance (avoids copy)

26
Approaches to Power Management
• Standard system power management
• Acceptable for many use cases
• Generic solution, leaving opportunity for
power optimization with some algorithms
Reactive vs Proactive
SystemApplication
Workload
Reactive Model
Power/Thermal
Adjustment
SystemApplication
Workload Power/Thermal
Adjustment
Proactive Model
Direct Recommendation
• Developer-driven power management
• Control power consumption during algorithm
execution
• Developer understanding of algorithm and system
can lead to additional power optimization
opportunity

27
Power Optimization SDK
• APIs to provide granular control of core
frequencies
• Developers request power control for their
algorithm
• Requests subject to system constraints
◦ Does not override system
◦ Interfaces with Perflock
• Static and Dynamic power management APIs
for CPU and GPU
Run-time power and performance control for CPU and GPU
Userspace
Application
Snapdragon
CPU GPU DSP
Static Dynamic
Perflock

28
• One API call to control CPU and GPU clock
frequency
• Choose one of 5 predefined modes
• Define the duration the mode should be active
• Target the device (big CPU, LITTLE CPU, GPU)
Static APIs
Userspace
Application
Snapdragon
CPU GPU DSP
Static Dynamic
Perflock

29
Using the Power Optimization SDK
Static APIs
Power Mode Description
Normal Default system state
Efficient Close to best performance with power savings
Performance Burst All cores at max frequency for short duration
Saver Half of peak performance
Window Set minimum and maximum frequency window

30
Set big Cluster to
operate between 50-
60% of max
frequency index
Power Optimization SDK – Static API
Code Samples

31
• Self-regulates performance while trying to
minimize energy consumption
• Realtime Applications – Games, Streaming,
Video
• Currently supported only in BIG Cluster
Dynamic APIs
Userspace
Application
Snapdragon
CPU GPU DSP
Static Dynamic
Perflock
Experimental

32
Using the Power Optimization SDK
Dynamic APIs
Power Mode Description
set_goal()
Start the automatic performance/power regulation
mode.
regulate()
Application feedback to the SDK, this is used by the
SDK to self-regulate the system to achieve the
performance and save power
clear_goal() Terminate the regulation process
Experimental

33
Goal: # of elements
application wants to
process in a
millisecond
Power Optimization SDK – Dynamic API
Code Samples Experimental

34
Application processing
Track # of elements
processed per
millisecond

35
Allow API to make
adjustments based on
number of elements being
processed per millisecond

36
Put the system back
to Normal state

37
Power
Improvement
Case Study
Using Heterogeneous Compute
SDK and Power Optimization
SDK to Improve Power

38
Lowering Power consumption
Heterogeneous execution is key for managing power/thermal
Using more cores and lowering their frequency can get
the same performance and consume less power

39
Case Study: Find all primes under 10 million
Sequential variant

40
Sequential variant
is_prime has some
optimizations already
like skipping even
numbers,
Run only through
sqrt(n)

41
Using Profiler
# of Cores
CPU Utilization
CPU Frequency
Processing Time
CPU Power

42
Sequential variant
# of Cores 1
CPU Utilization 100%
CPU Frequency Max(1.9GHz)
Processing Time 34 seconds
CPU Power 125mW

43
Sequential variant
# of Cores 1
CPU Power 125mW

44
Sequential variant
# of Cores 1
CPU Power 125mW

45
Sequential variant
# of Cores 1
CPU Power 125mW
34 sec

46
Sequential variant
# of Cores 1
CPU Power 125mW

47
What can we parallelize?

48
What can we parallelize?
Iterative loop -
pfor_each can be used
to parallelize

49
Parallel variant
Simple parallel version –
could improve work
distribution between
big/LITTLE and chunk
size

50
Parallel variant
# of Cores 8
CPU Frequency Max(1.90/2.36 GHz)
Processing Time 6.2 seconds
CPU Power 281 mW

51
Parallel variant
# of Cores 8
CPU Power 281 mW

52
Parallel variant
# of Cores 8
CPU Power 281 mW

53
Parallel variant
# of Cores 8
CPU Power 281 mW
~6sec

54
Parallel variant
# of Cores 8
CPU Power 281 mW

55
Compare Sequential and Parallel variant
Sequential Parallel
# of Cores 1 8
CPU Utilization 100% 100%
CPU Frequency Max (1.90 GHz) Max (1.90/2.36 GHz)
Processing Time 34 sec 6.2 sec (82%)
CPU Power 125 mW 281 mW (55%)

56
Parallel variant
Can we use
Power SDK to fine-
tune Power
Consumption?

57
Parallel variant with Power Tuning
Goal: Max Power Savings
Request big and LITTLE
cluster run at 0-15% of max
frequency

58
Parallel variant with Power Tunings
# of Cores 8
CPU Frequency Min(512/652 MHz)
CPU Power 82 mW

59
# of Cores 8
CPU Power 82 mW

60
# of Cores 8
CPU Power 82 mW

61
# of Cores 8
CPU Power 82 mW
26sec

62
# of Cores 8
CPU Power 82 mW

63
Comparison chart - Recap
Sequential Parallel
Parallel with
Power SDK (min
freq)
# of Cores 1 8 8
CPU Utilization 100% 100% 100%
CPU Frequency Max(1.90 GHz)
Max(1.90/2.36
GHz)
Min(512/652 MHz)
Processing Time 34 sec 6.2 sec (82%) 26 sec
CPU Power 125 mW 281 mW (55%) 82 mW

64
Comparison chart
Sequential Parallel
Parallel with
Power SDK (min
freq)
# of Cores 1 8 8
Max(1.90/2.36
GHz)
Min(512/652 MHz)
Processing Time 34 sec 6.2 sec (82%) 26 sec (23%)
CPU Power 125 mW 281 mW (55%) 82 mW (34%)

65
Choose the optimal power-performance
Sequential Parallel
Parallel with
Power SDK (min
freq)
# of Cores 1 8 8
Max(1.90/2.36
GHz)
Min(512/652 MHz)
Processing Time 34 sec 6.2 sec (82%) 26 sec (23%)
CPU Power 125 mW 281 mW (55%) 82 mW (34%)

66
Lowering Power consumption
Strategy for power savings
• Using more cores and lowering their frequency allows us to get the same
performance with lower energy
• Choosing right compute device is the key to lowering power
◦ Big/LITTLE/GPU/DSP
• Strategy to reduce power maintaining performance
Extract Parallelism
Control placement of
algorithm execution
onto right device
Power Tuning
using Power
SDK

Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank you!
Nothing in these materials is an offer to sell any of the
components or devices referenced herein.
©2018 Qualcomm Technologies, Inc. and/or its affiliated
companies. All Rights Reserved.
Qualcomm, Snapdragon and Hexagon are trademarks of
Qualcomm Incorporated, registered in the United States
and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective
owners.
References in this presentation to “Qualcomm” may mean Qualcomm
Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
applicable. Qualcomm Incorporated includes Qualcomm’s licensing
business, QTL, and the vast majority of its patent portfolio. Qualcomm
Technologies, Inc., a wholly-owned subsidiary of Qualcomm
Incorporated, operates, along with its subsidiaries, substantially all of
Qualcomm’s engineering, research and development functions, and
substantially all of its product and services businesses, including its
semiconductor business, QCT.

Balancing Power & Performance Webinar

More Related Content

Similar to Balancing Power & Performance Webinar

More from Qualcomm Developer Network

Recently uploaded

Balancing Power & Performance Webinar