Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314

1,646 views

Published on

Session ID: SFO17-314
Session Name: Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314
Speaker: Wei Xiao - Fannie Zhang
Track: LEG


★ Session Summary ★
It is a guide to ARM64 GoLang assembly. It introduces how to write a ARM64 GoLang assembly program and some descriptions of key Go-specific details for ARM64. When we get to know the go assembly, we can do some optimization to improve performance. We will also show case with an example of SHA optimization.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-314/
Presentation:
Video: https://www.youtube.com/watch?v=Q_pmdO7sFC4
---------------------------------------------------

★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport

---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314

  1. 1. © 2017 Arm Limited SFO17-314 Optimizing Golang for High Performance with ARM64 AssemblyWei Xiao Staff Software Engineer Wei.Xiao@arm.com September 27, 2017 Linaro Connect SFO17
  2. 2. © 2017 Arm Limited2 Agenda • Introduction • Differences from GNU Assembly • Integrate assembly into Golang • Optimize CRC32 for arm64 • Optimize SHA256 for arm64 • Optimize IndexByte for arm64 • Work Summary and Next steps
  3. 3. © 2017 Arm Limited3 Introduction • Assembly optimization benefits • Take advantages of ARMv8 capabilities – Hardware specific instructions (such as SVC, AES, SHA and etc.) – Vector (Single Instruction Multiple Data) Instructions • Others – No need for CGo dependency – Avoid runtime context switching overhead – Optimized code (vs Go compiler) – Faster compilation
  4. 4. © 2017 Arm Limited4 Assembly Optimization Current Status • Go Standard packages with assembly optimization crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5 crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512 hash/crc32 math math/big reflect runtime runtime/cgo runtime/internal/atomicruntime/internal/sys strings sync/atomic syscall …… red – arm64 optimization ongoing black – no arm64 optimization
  5. 5. © 2017 Arm Limited5 Assembly Terminology • Mnemonic • CALL, MOVW, MOVD, … • Register • R1, F0, V3, … • Immediate • $1, $0x100, … • Memory • (R1), 8(R3), … Registers in AArch64
  6. 6. © 2017 Arm Limited6 Instruction Differences from GNU Assembly • Semi-abstract instruction set (Plan 9 from Bell Labs) • Architecture independent mnemonics like MOVD • Some architecture aspects shine through • Assembler may insert prologues, remove ‘unreachable’ instructions • Instructions may be expanded by the assembler • Not all instructions available • BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly 1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET
  7. 7. © 2017 Arm Limited7 Operand Differences from GNU Assembly • Data flow from left to right • ADD R1, R2 → R2 += R1 • SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29) • Memory operands: base + offset • MOVH (R1), R2 → R2 = *R1 • MOVBU 8(R3), R4 → R4 = *(8 + R3) • MOVD mypackage·myvar(SB), R8 → R8 = *myvar • Addresses • MOVD $8(R1), R3 → R3 = R1 + 8 • MOVD $·myvar(SB), R4 → R4 = &myvar package mypackage var myvar int64 Unicode U+00B7
  8. 8. © 2017 Arm Limited8 Go Assembly Extension for arm64 • Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd • Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T> • Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd • Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>] • Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>] • Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go • Full details • https://go-review.googlesource.com/c/go/+/41654
  9. 9. © 2017 Arm Limited9 Assembly Build Rule • Toolchain will select appropriate assembly files according to GOOS+GOARCH • Using file extensions, e.g. • sys_linux_arm64.s • sys_darwin_arm64.s • Example: assembly files for: hash/crc32 • crc32_amd64p32.s • crc32_amd64.s • crc32_arm64.s • crc32_ppc64le.s crc32_table_ppc64le.s • crc32_s390x.s
  10. 10. © 2017 Arm Limited10 Prototype • Function call is the bridge between Go and assembly • Function declaration • src/runtime/timestub.go • func walltime() (sec int64, nsec int32) • Function assembly implementation • runtime/sys_linux_arm64.s package (optional) function name Flag (optional) stack frame size arguments size (optional) Middle dot
  11. 11. © 2017 Arm Limited11 Pseudo-registers • FP: Frame Pointer • Points to the bottom of the argument list • Offsets are positive • Offsets must include a name, e.g. arg+0(FP) • SP: Stack Pointer • Points to the top of the space allocated for local variables • Offsets are negative • Offsets must include a name, e.g. ptr-8(SP) • SB: Static Base • Named offsets from a global base Low address High address Low address High address
  12. 12. © 2017 Arm Limited12 Calling Convention • All arguments are passed on the stack • Offsets from FP • Return arguments follow input arguments • Start of return arguments aligned to pointer size • All registers are caller saved, except: • Stack pointer register (RSP) • G context pointer register (R28) • Frame pointer (R29)
  13. 13. © 2017 Arm Limited13 arm64 Stack Frame w/o frame pointer w/ frame pointer Low address High address
  14. 14. © 2017 Arm Limited14 Optimize CRC32 for arm64 – Before • Pure Go table-driven implementation src/hash/crc32/crc32_generic.go 42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 { 43 crc = ^crc 44 for _, v := range p { 45 crc = tab[byte(crc)^v] ^ (crc >> 8) 46 } 47 return ^crc 48 }
  15. 15. © 2017 Arm Limited15 Optimize CRC32 for arm64 – After • Assembly for arm64 src/hash/crc32/crc32_arm64.s 9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update … 46 done: 47 MOVWU R9, ret+32(FP) 48 RET 0(FP) ret p.cap p.len p.base crc 32(FP) 8(FP) 16(FP)
  16. 16. © 2017 Arm Limited16 Optimize CRC32 for arm64 – Result • Optimization with assembly • 2X-7X speedup
  17. 17. © 2017 Arm Limited17 Optimize SHA256 for arm64 • SHA256 introduction block rounds K Hash SHA-256 512bits 64 32bits 32bits 256bits
  18. 18. © 2017 Arm Limited18 Optimize SHA256 for arm64 – Message schedule src/crypto/sha256/sha256block.go 84 for i := 0; i < 16; i++ { 85 j := i * 4 86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3]) 87 } 88 for i := 16; i < 64; i++ { 89 v1 := w[i-2] 90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10) 91 v2 := w[i-15] 92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3) 93 w[i] = t1 + w[i-7] + t2 + w[i-16] 94 } for i := 16; i < 64; i+=4 { SHA256SU0 Vn.S4, Vd.S4 SHA256SU1 Vm.S4, Vn.S4, Vd.S4 }
  19. 19. © 2017 Arm Limited19 Optimize SHA256 for arm64 – Hash Computation src/crypto/sha256/sha256block.go 98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i] 100 101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c)) 102 103 h = g 104 g = f 105 f = e 106 e = d + t1 107 d = c 108 c = b 109 b = a 110 a = t1 + t2 111 } for i := 0; i < 64; i+=4 { SHA256H Vm, Vn, Vd.4S SHA256H2 Vm, Vn, Vd.4S }
  20. 20. © 2017 Arm Limited20 Optimize SHA256 for arm64 – Implementation src/crypto/sha256/sha256block_arm64.s
  21. 21. © 2017 Arm Limited21 Optimize SHA256 for arm64 – Result • Optimization with assembly • 2X-16X speedup
  22. 22. © 2017 Arm Limited22 Optimize IndexByte for arm64 – Before H E L L O W O R L D … R1R0 R2 D R0 src/runtime/asm_arm64.s
  23. 23. © 2017 Arm Limited23 Optimize IndexByte for arm64 – After • Assembly implementation with SIMD • SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16 Compare 16 bytes in parallel More details: • Input slice shorter than 16 • Input slice address not 16-byte aligned • Input slice size not 16-byte aligned • Count trailing zeros (not leading zeros) • Implementation: • https://go-review.googlesource.com/c/go/+/41654
  24. 24. © 2017 Arm Limited24 Optimize IndexByte for arm64 – Result • Optimization with SIMD • 1.5X-8X speedup
  25. 25. © 2017 Arm Limited25 Work Summary Disassembler (arm64): https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930 https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530 Assembler (arm64): https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951 https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350 https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653 Optimizations: https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570 https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610 Others: https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112 https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511 https://go-review.googlesource.com/c/arch/+/37172
  26. 26. © 2017 Arm Limited26 Next Steps • Crypto optimizations: • aes, elliptic, … • SIMD optimizations: • strings, bytes, runtime, reflect, … • Compiler SSA arm64 back-end optimizations • Others • Internal arm64 linker • Tool for arm64: race detector, memory sanitizer, … • New architecture features • ...
  27. 27. 2727 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! © 2017 Arm Limited
  28. 28. © 2017 Arm Limited28 CGo GO ABI C ABI 1 package print 2 3 // #include <stdio.h> 4 // #include <stdlib.h> 5 import "C" 6 import "unsafe" 7 8 func Print(s string) { 9 cs := C.CString(s) 10 C.fputs(cs, 11(*C.FILE)(C.stdout)) 12 C.free(unsafe.Pointer(cs)) 13 } CGo
  29. 29. © 2017 Arm Limited29 Useful in macros! Branch Difference from GNU Assembly • On arm64: B is alias for JMP, BL is alias for CALL Jump to labels JMP L1 NOP L1: NOP L2: NOP NOP B L2 Call and Indirect Jump BL $p.foo MOV $p·foo, R3 CALL(R3) B (R3) MOV 0(R26), R4 JMP (R4) Jump relative to PC JMP 2(PC) NOP NOP NOP NOP JMP -2(PC)

×