2. Contents
Why 64-bit in mobile?
64-bit support in Android™ Lollipop
Where next?
Summary
2
3. The ARM® 64-bit Architecture : ARMv8-A
3
Full native 32-bit execution, side-by-side
with 64-bit
New, modern, A64 instruction set
architecture (ISA)
Double the number (and size) of registers
New instructions for both A32 and A64
AArch32 AArch64
T32 + A32
Crypto
Advanced SIMD
Scalar FP
ARMv8-A
ISA
A64 ISA
4. Why 64-bit in mobile?
4
Performance through
architecture
Cleaner instruction set architecture
Hard-float ABI by default in ARMv8-A
More registers, less stack spillage
Cheaper function calls
Up to 16x crypto acceleration
Preparation for larger memory devices
5. 64-bit support in Android Lollipop
5
64-bit support for ARM®
32-bit & 64-bit apps exist in the same build
Also, introduces the ART runtime
Source : http://www.android.com/
6. What does L mean for developers?
6
Pure Java apps get ARMv8-A benefit for
free via ART
32-bit NDK apps run without change, and
at full performance
Rebuild NDK code with
APP_ABI="arm64-v8a” to take full
advantage of A64
Interworking rules mean Java apps run as
32-bits if they call 32-bit NDK code
7. What is ART?
7
ART is a replacement for Dalvik
AOT vs JIT (ahead of time - i.e. at install)
Redesigned to be better on multi-core
systems
Fits well with big.LITTLE™ technology
Measured on Nexus 7 with Dalvik/ART Preview on 4.4
200%
100%
0%
Quadrant CPU Linpack MT
Relative to Dalvik JIT
Dalvik ART
8. ART on ARMv8-A: performance features
8
Utilizes the modern A64 ISA for 64-bit
apps
Single-cycle instructions for Java long &
double types
Uses hard-float ABI
32-bit object references - no 64-bit
pointer penalty
Rocket by Luis Prado from the Noun Project
9. Considerations for native developers
9
Porting C code to 64-bit is the same as for
any other architecture
Review your feature detection code when
moving to 64-bit
Assembly code needs to be ported to the
more efficient A64 ISA
NEON™ changes can be simply
recompiled if written using compiler
intrinsics
Change graphic
10. NEON Intrinsics
Include intrinsics header file (ACLE standard)
10
#include <arm_neon.h>
Use special NEON data types which
correspond to D and Q registers, e.g.
int8x8_t D-register 8x 8-bit values
int16x4_t D-register 4x 16-bit values
int32x4_t Q-register 4x 32-bit values
Use NEON intrinsics versions of instructions
vin1 = vld1q_s32(ptr);
vout = vaddq_s32(vin1, vin2);
vst1q_s32(vout, ptr);
Strongly typed!
Use vreinterpret_s16_s32( ) to change the type
Fully compatible with AArch64
tmp1 = vmull_u8(vreinterpret_u8_u32(va0), v16_y); // tmp1 = [tmp2 = vmull_u8(vreinterpret_u8_u32(va1), vy); // tmp2 = [static inline void Filter_32_opaque_neon(unsigned x, unsigned y,
SkPMColor a00, SkPMColor a01,
SkPMColor a10, SkPMColor a11,
SkPMColor *dst) {
uint8x8_t vy, vconst16_8, v16_y, vres;
uint16x4_t vx, vconst16_16, v16_x, tmp;
uint32x2_t va0, va1;
uint16x8_t tmp1, tmp2;
vy = vdup_n_u8(y); // duplicate y into vy
vconst16_8 = vmov_n_u8(16); // set up constant in vconst16_v16_y = vsub_u8(vconst16_8, vy); // v16_y = 16-y
va0 = vdup_n_u32(a00); // duplicate a00
va1 = vdup_n_u32(a10); // duplicate a10
va0 = vset_lane_u32(a01, va0, 1); // set top to a01
va1 = vset_lane_u32(a11, va1, 1); // set top to a11
11. Compatibility
C/instrinsics will port with no effort
Asm requires reworking of .s file
(mostly cosmetic, but can take
advantage of additional registers)
AArch64 NEON optimization in
progress
11
ARM & Linaro working on key Android
libraries using intrinsics
ffmpeg AArch64 NEON decoders (asm)
X264 AArch64 NEON encoder (asm)
AArch64 NEON coding
technique
Compatible?
Vectorized “C” Fully compatible
Intrinsics
(“arm_neon.h”)
Fully compatible
Asm (.s) Some porting required
Library routines Yes, if library available
12. Performance – Native
12
30%
25%
20%
15%
10%
5%
0%
Single Thread Multithreaded
AArch64 improvement
over AArch32
AnTuTu 32/64bit CPU Test v5.0
Measured on Juno (2x Cortex-A57, 4x Cortex-A53)
30%
25%
20%
15%
10%
5%
0%
bionic
AArch64 improvement
over AArch32
bionic-benchmarks
13. Performance – ART
13
Quadrant 2.0
30%
25%
20%
15%
10%
5%
Measured on Juno (2x Cortex-A57, 4x Cortex-A53)
30%
25%
20%
15%
10%
5%
0%
AArch64 improvement
over AArch32
CPU Score
0%
AArch64 improvement
over AArch32
Linpack
Multi-threaded
14. Want to know more?
Join us in the Android group at
Connected Community!
14
http://community.arm.com/groups/android-community
ARMv8-A Porting Guide:
http://community.arm.com/docs/DOC-
8453
Taming ARMv8-A NEON: from theory
to benchmark results
http://youtu.be/ixuDntaSnHI?list=UUIVqQ
KxCyQLJS6xvSmfndLA
Porting & optimizing for 64-bit, a compiler
perspective
http://www.linaro.org/assets/common/campus
-party-presentation-Sept_2013.pdf
https://www.youtube.com/watch?v=epzYErIIx
0Y
An OSX perspective of the 32-64-bit
transition
https://developer.apple.com/library/mac/docu
mentation/Darwin/Conceptual/64bitPorting/in
tro/intro.html
15. Summary
The ARMv8-A architecture makes the
difference for mobile and 64-bit
Android Lollipop provides multi-arch
support enabling both 32/64-bit
applications
Performance gains for those taking
advantage of the ARMv8-A architecture
Come join us at Connected Community
15
16. 16
Thank You
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners
The Android robot is reproduced or modified from work created and shared by Google and used according to terms described in the
Creative Commons 3.0 Attribution License.
Google Play is a trademark of Google Inc.