Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automatic Vectorization in ART (Android RunTime) - SFO17-216

357 views

Published on

Session ID: SFO17-216
Session Name: Automatic Vectorization in ART (Android RunTime) - SFO17-216
Speaker: Aart Bik - Artem Serov
Track: LMG


★ Session Summary ★
Because all modern general-purpose CPUs support small-scale SIMD
instructions (typically between 64-bit and 512-bit), modern compilers
are becoming progressively better at taking advantage of SIMD
instructions automatically, a translation often referred to as
vectorization or SIMDization. Since the Android O release, the
optimizing compiler of ART has joined the family of vectorizing
compilers with the ability to translate bytecode into native SIMD code
for the target Android device. This talk will discuss the general
organization of the retargetable part of the vectorizer, which is
capable of automatically finding and exploiting vector instructions in
bytecode without committing to one of the target SIMD architectures
yet (currently ARM NEON (advanced SIMD), x86 SSE, and MIPS SIMD
Architecture). Furthermore the talk will present particular details of
deploying the vectorizing compiler on ARM platforms - its overall
impact on performance, some ARM specific considerations and
optimizations - and also will give an update on Linaro ART team's
SIMD-related activities.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-216/
Presentation:
Video: https://www.youtube.com/watch?v=KOD5D_DjzaI
---------------------------------------------------

★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport

---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961

Published in: Technology
  • Be the first to comment

Automatic Vectorization in ART (Android RunTime) - SFO17-216

  1. 1. ENGINEERS AND DEVICES WORKING TOGETHER
  2. 2. ENGINEERS AND DEVICES WORKING TOGETHER
  3. 3. ● ○ ○ ○ ● ●
  4. 4. K (Android 4.4): Dalvik + JIT compiler L (Android 5.0): ART + AOT compiler M (Android 6.0): ART + AOT compiler N (Android 7.0): ART + JIT/AOT compiler O (Android 8.0): ART + JIT/AOT compiler + vectorization
  5. 5. ● ● ● ● ● ●
  6. 6. ENGINEERS AND DEVICES WORKING TOGETHER
  7. 7. A SIMD instruction performs a single operation to multiple operands in parallel ARM: NEON Technology (128-bit) Intel: SSE* (128-bit) AVX* (256-bit, 512-bit) MIPS: MSA (128-bit) All modern general-purpose CPUs support small-scale SIMD instructions (typically between 64-bit and 512-bit) 4x32-bit operations
  8. 8. ● ○ ○ ○ ● ○ ○ ○
  9. 9. ● Many vectorizing compilers were developed by supercomputer vendors ● Intel introduced first vectorizing compiler for SSE in 1999 ● Since the Android O release, the optimizing compiler of ART has joined the family of vectorizing compilers www.aartbik.com
  10. 10. ENGINEERS AND DEVICES WORKING TOGETHER
  11. 11. for (int i = 0; i < 256; i++) { for (int i = 0; i < 256; i += 4) { a[i] = b[i] + 1; -> a[i:i+3] = b[i:i+3] + [1,1,1,1]; } }
  12. 12. Ronny Reader Abby AuthorWendy Writer Perry Presenter Vinny Viewer Molly Maker Casey Creator VectorOperation VectorMemOpVectorBinOp VectorAdd VectorSub VectorLoad VectorStore …. …. has alignment has vector length has packed data type A class hierarchy of general vector operations that is sufficiently powerful to represent SIMD operations common to all architectures
  13. 13. t = [1,1,1,1]; for (int i = 0; i < 256; i += 4) { -> for (int i = 0; i < 256; i += 8) { a[i:i+3] = b[i:i+3] + [1,1,1,1]; a[i :i+3] = b[i :i+3] + t; } a[i+4:i+7] = b[i+4:i+7] + t; }
  14. 14. t = [1,1,1,1]; for (int i = 0; i < 256; i += 8) { -> a[i:i+3] = b[i:i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t; } movi v0.4s, #0x1, lsl #0 mov w3, #0xc mov w0, #0x0 Loop: cmp w0, #0x100 (256) b.hs Exit add w4, w0, #0x4 (4) add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q1, [x2, x0] add v1.4s, v1.4s, v0.4s str q1, [x1, x0] ldr q1, [x2, x5] add v1.4s, v1.4s, v0.4s str q1, [x1, x5] add w0, w4, #0x4 (4) ldrh w16, [tr] ; suspend check cbz w16, Loop
  15. 15. VecReplicateScalar(x) ARM64 x86-64 MIPS64 dup v0.4s, w2 movdq xmm0, rdx fill.w w0, a2 pshufd xmm0, xmm0, 0
  16. 16. /** * Cross-fade byte arrays x1 and x2 into byte array x_out. */ private static void avg(byte[] x_out, byte[] x1, byte[] x2) { // Compute minimum length of the three byte arrays. int min = Math.min(x_out.length, Math.min(x1.length, x2.length)); // Morph with rounding halving add (unsigned). for (int i = 0; i < min; i++) { x_out[i] = (byte) (((x1[i] & 0xff) + (x2[i] & 0xff) + 1) >> 1); } }
  17. 17. SEQUENTIAL (ARMv8 AArch64) L:cmp w5, w0 b.hs Exit add w4, w2, #0xc (12) add w6, w3, #0xc (12) ldrsb w4, [x4, x5] ldrsb w6, [x6, x5] and w4, w4, #0xff and w6, w6, #0xff add w4, w4, w6 add w6, w1, #0xc (12) add w4, w4, #0x1 (1) asr w4, w4, #1 strb w4, [x6, x5] add w5, w5, #0x1 (1) ldrh w16, [tr] ; suspend check cbz w16, L SIMD (ARMv8 AArch64 + NEON Technology) L:cmp w5, w4 b.hs Exit add w16, w2, w5 ldur q0, [x16, #12] add w16, w3, w5 ldur q1, [x16, #12] urhadd v0.16b, v0.16b, v1.16b add w16, w1, w5 stur q0, [x16, #12] add w5, w5, #0x10 (16) ldrh w16, [tr] ; suspend check cbz w16, L Runs about 10x faster!
  18. 18. Sequential performance SIMD performance (NEON 128-bit) ≈20fps ≈60fps
  19. 19. ENGINEERS AND DEVICES WORKING TOGETHER
  20. 20. ENGINEERS AND DEVICES WORKING TOGETHER Java code Autovectorization result void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; } } ● ○ ● ○ ○
  21. 21. ENGINEERS AND DEVICES WORKING TOGETHER Java code Autovectorization result void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; } } L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ○ ● ○ ○ ● ○
  22. 22. ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ●
  23. 23. ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ●
  24. 24. ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ● ○ ● ○ ● ○ ●
  25. 25. ENGINEERS AND DEVICES WORKING TOGETHER Before After (11% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○
  26. 26. ENGINEERS AND DEVICES WORKING TOGETHER Before After (11% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○
  27. 27. ENGINEERS AND DEVICES WORKING TOGETHER Before After (11% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ○ ○
  28. 28. ENGINEERS AND DEVICES WORKING TOGETHER Before After (23% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ● ○ ○ ○ ○
  29. 29. ENGINEERS AND DEVICES WORKING TOGETHER Before After (23% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ● ○ ○ ○ ○
  30. 30. ENGINEERS AND DEVICES WORKING TOGETHER Before After (23% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ● ○ ○ ○ ○ ● ○ ○ ●
  31. 31. ENGINEERS AND DEVICES WORKING TOGETHER Before After (10% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L mov w3, #0xc L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ○ ●
  32. 32. ENGINEERS AND DEVICES WORKING TOGETHER Before After (10% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L mov w3, #0xc L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ○ ●
  33. 33. ENGINEERS AND DEVICES WORKING TOGETHER Before After (10% perf boost) L: cmp w0, #0x200 b.hs Exit add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L mov w3, #0xc L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ○ ● ○ ○ ● ●
  34. 34. ENGINEERS AND DEVICES WORKING TOGETHER
  35. 35. ENGINEERS AND DEVICES WORKING TOGETHER Before After (2.5% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ●
  36. 36. ENGINEERS AND DEVICES WORKING TOGETHER Before After (2.5% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ●
  37. 37. ENGINEERS AND DEVICES WORKING TOGETHER Before After (2.5% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ● ● ○ ○
  38. 38. ENGINEERS AND DEVICES WORKING TOGETHER Before After (12% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ●
  39. 39. ENGINEERS AND DEVICES WORKING TOGETHER Before After (12% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ●
  40. 40. ENGINEERS AND DEVICES WORKING TOGETHER Before After (12% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ● ● ● ○ ● ○ ○ ● ○
  41. 41. ENGINEERS AND DEVICES WORKING TOGETHER Before After (12% perf boost) L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ● ● ● ○ ● ○ ○ ● ○ ●
  42. 42. ENGINEERS AND DEVICES WORKING TOGETHER for (int i = 0; i < LENGTH; i++) { c[i] = (byte)(a[i] + b[i]); } i87 Add [i80,i79] i102 IntermediateAddressIndex [i87,i98,i3] i99 IntermediateAddressIndex [i80,i98,i3] d89 VecLoad [l35,i102] d84 VecLoad [l35,i99] d83 VecLoad [l29,i99] d88 VecLoad [l29,i102] d85 VecAdd [d83,d84] d90 VecAdd [d88,d89] d86 VecStore [l27,i99,d85] d91 VecStore [l27,i102,d90] i92 Add [i87,i79] v78 Goto ● ○ ○ ●
  43. 43. ENGINEERS AND DEVICES WORKING TOGETHER (gdb) x/64u 0xefc0b000 0xefc0b000: 0 28 192 18 0 0 0 0 0xefc0b008: 0 0 4 0 100 101 102 103 0xefc0b010: 104 105 106 107 108 109 110 111 0xefc0b018: 112 113 114 115 116 117 118 119 0xefc0b020: 120 121 122 123 124 125 126 127 0xefc0b028: 128 129 130 131 132 133 134 135 0xefc0b030: 136 137 138 139 140 141 142 143 0xefc0b038: 144 145 146 147 148 149 150 151 Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000 static byte [] a = new byte[LENGTH]; static byte [] b = new byte[LENGTH]; static byte [] c = new byte[LENGTH]; Object Header data[0]
  44. 44. ENGINEERS AND DEVICES WORKING TOGETHER (gdb) x/64u 0xefc0b000 0xefc0b000: 0 28 192 18 0 0 0 0 0xefc0b008: 0 0 4 0 100 101 102 103 0xefc0b010: 104 105 106 107 108 109 110 111 0xefc0b018: 112 113 114 115 116 117 118 119 0xefc0b020: 120 121 122 123 124 125 126 127 0xefc0b028: 128 129 130 131 132 133 134 135 0xefc0b030: 136 137 138 139 140 141 142 143 0xefc0b038: 144 145 146 147 148 149 150 151 One VecLoad / VecStore Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000 static byte [] a = new byte[LENGTH]; static byte [] b = new byte[LENGTH]; static byte [] c = new byte[LENGTH]; Object Header
  45. 45. ENGINEERS AND DEVICES WORKING TOGETHER ● ○ ● ○ ○ ○ ● 0xefc0b000: 0 28 192 18 0 0 0 0 0xefc0b008: 0 0 4 0 100 101 102 103 0xefc0b010: 104 105 106 107 108 109 110 111 0xefc0b018: 112 113 114 115 116 117 118 119 0xefc0b020: 120 121 122 123 124 125 126 127 0xefc0b028: 128 129 130 131 132 133 134 135 0xefc0b030: 136 137 138 139 140 141 142 143 0xefc0b038: 144 145 146 147 148 149 150 151 SIMD from here-> Avoid SIMD from here
  46. 46. ENGINEERS AND DEVICES WORKING TOGETHER
  47. 47. ENGINEERS AND DEVICES WORKING TOGETHER ● ○ ● ● ○ ○
  48. 48. ENGINEERS AND DEVICES WORKING TOGETHER ● ○ ○ ● ● ● ● ○
  49. 49. ENGINEERS AND DEVICES WORKING TOGETHER ● ● ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ Analyzable and flexible CHECKED! Embeddable CHECKED! Stable and reproducible CHECKED! Recognized CHECKED!
  50. 50. ENGINEERS AND DEVICES WORKING TOGETHER ● ● ○ ○ ○ ● ○ ○ ○
  51. 51. ENGINEERS AND DEVICES WORKING TOGETHER ●
  52. 52. ENGINEERS AND DEVICES WORKING TOGETHER ●
  53. 53. ENGINEERS AND DEVICES WORKING TOGETHER ●
  54. 54. ENGINEERS AND DEVICES WORKING TOGETHER ●
  55. 55. ENGINEERS AND DEVICES WORKING TOGETHER ● ○ ● ○ ● ○ ● ○ LDR q1, [x16] + LDR q2, [x16, #16] -> LDP q1, q2, [x16] ● ○
  56. 56. ENGINEERS AND DEVICES WORKING TOGETHER ● ● ○ ● ○ ○
  57. 57. Java Scalar version Initial SIMD Version void mul_add(int[] a, int[] b, int[] c) -{ for (int i=0; i<512; i++) { a[i] += a[i] * b[i]; } } L: cmp w0, #0x200 b.hs Exit add w4, w1, #0xc ldr w6, [x4, x0, lsl #2] add w5, w2, #0xc ldr w5, [x5, x0, lsl #2] madd w5, w6, w5, w6 str w5, [x4, x0, lsl #2] add w0, w0, #0x1 ldrh w16, [tr] cbz w16, L L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L

×