© 2017 Arm Limited
SFO17-314 Optimizing Golang for
High Performance with ARM64
AssemblyWei Xiao
Staff Software Engineer
Wei.Xiao@arm.com
September 27, 2017
Linaro Connect SFO17
© 2017 Arm Limited2
Agenda
• Introduction
• Differences from GNU Assembly
• Integrate assembly into Golang
• Optimize CRC32 for arm64
• Optimize SHA256 for arm64
• Optimize IndexByte for arm64
• Work Summary and Next steps
© 2017 Arm Limited3
Introduction
• Assembly optimization benefits
• Take advantages of ARMv8 capabilities
– Hardware specific instructions (such as SVC, AES, SHA and etc.)
– Vector (Single Instruction Multiple Data) Instructions
• Others
– No need for CGo dependency
– Avoid runtime context switching overhead
– Optimized code (vs Go compiler)
– Faster compilation
© 2017 Arm Limited4
Assembly Optimization Current Status
• Go Standard packages with assembly optimization
crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5
crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512
hash/crc32 math math/big reflect
runtime runtime/cgo runtime/internal/atomicruntime/internal/sys
strings sync/atomic syscall ……
red – arm64 optimization ongoing
black – no arm64 optimization
© 2017 Arm Limited5
Assembly Terminology
• Mnemonic
• CALL, MOVW, MOVD, …
• Register
• R1, F0, V3, …
• Immediate
• $1, $0x100, …
• Memory
• (R1), 8(R3), …
Registers in AArch64
© 2017 Arm Limited6
Instruction Differences from GNU Assembly
• Semi-abstract instruction set (Plan 9 from Bell Labs)
• Architecture independent mnemonics like MOVD
• Some architecture aspects shine through
• Assembler may insert prologues, remove ‘unreachable’
instructions
• Instructions may be expanded by the assembler
• Not all instructions available
• BYTE/WORD/LONG directives to lay down opcodes into
instruction stream directly
1 // func Add(a, b int) int
2 TEXT ·Add(SB),$0-24
3 MOVD arg1+0(FP), R0
4 MOVD arg2+8(FP), R1
5 ADD R1, R0, R0
6 MOVD R0, ret+16(FP)
7 RET
© 2017 Arm Limited7
Operand Differences from GNU Assembly
• Data flow from left to right
• ADD R1, R2 → R2 += R1
• SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29)
• Memory operands: base + offset
• MOVH (R1), R2 → R2 = *R1
• MOVBU 8(R3), R4 → R4 = *(8 + R3)
• MOVD mypackage·myvar(SB), R8 → R8 = *myvar
• Addresses
• MOVD $8(R1), R3 → R3 = R1 + 8
• MOVD $·myvar(SB), R4 → R4 = &myvar
package mypackage
var myvar int64
Unicode
U+00B7
© 2017 Arm Limited8
Go Assembly Extension for arm64
• Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd
• Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T>
• Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd
• Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>]
• Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>]
• Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go
• Full details
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited9
Assembly Build Rule
• Toolchain will select appropriate assembly files according to GOOS+GOARCH
• Using file extensions, e.g.
• sys_linux_arm64.s
• sys_darwin_arm64.s
• Example: assembly files for: hash/crc32
• crc32_amd64p32.s
• crc32_amd64.s
• crc32_arm64.s
• crc32_ppc64le.s crc32_table_ppc64le.s
• crc32_s390x.s
© 2017 Arm Limited10
Prototype
• Function call is the bridge between Go and assembly
• Function declaration
• src/runtime/timestub.go
• func walltime() (sec int64, nsec int32)
• Function assembly implementation
• runtime/sys_linux_arm64.s
package
(optional)
function
name
Flag
(optional)
stack
frame size
arguments
size
(optional)
Middle
dot
© 2017 Arm Limited11
Pseudo-registers
• FP: Frame Pointer
• Points to the bottom of the argument list
• Offsets are positive
• Offsets must include a name, e.g. arg+0(FP)
• SP: Stack Pointer
• Points to the top of the space allocated for local variables
• Offsets are negative
• Offsets must include a name, e.g. ptr-8(SP)
• SB: Static Base
• Named offsets from a global base
Low address
High address
Low address
High address
© 2017 Arm Limited12
Calling Convention
• All arguments are passed on the stack
• Offsets from FP
• Return arguments follow input arguments
• Start of return arguments aligned to pointer size
• All registers are caller saved, except:
• Stack pointer register (RSP)
• G context pointer register (R28)
• Frame pointer (R29)
© 2017 Arm Limited13
arm64 Stack Frame
w/o frame pointer w/ frame pointer
Low address
High address
© 2017 Arm Limited14
Optimize CRC32 for arm64 – Before
• Pure Go table-driven implementation
src/hash/crc32/crc32_generic.go
42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 {
43 crc = ^crc
44 for _, v := range p {
45 crc = tab[byte(crc)^v] ^ (crc >> 8)
46 }
47 return ^crc
48 }
© 2017 Arm Limited15
Optimize CRC32 for arm64 – After
• Assembly for arm64
src/hash/crc32/crc32_arm64.s
9 // func castagnoliUpdate(crc uint32, p []byte) uint32
10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36
11 MOVWU crc+0(FP), R9 // CRC value
12 MOVD p+8(FP), R13 // data pointer
13 MOVD p_len+16(FP), R11 // len(p)
14
15 CMP $8, R11
16 BLT less_than_8
17
18 update:
19 MOVD.P 8(R13), R10
20 CRC32CX R10, R9
21 SUB $8, R11
22
23 CMP $8, R11
24 BLT less_than_8
25
26 JMP update
…
46 done:
47 MOVWU R9, ret+32(FP)
48 RET
0(FP)
ret
p.cap
p.len
p.base
crc
32(FP)
8(FP)
16(FP)
© 2017 Arm Limited16
Optimize CRC32 for arm64 – Result
• Optimization with assembly
• 2X-7X speedup
© 2017 Arm Limited17
Optimize SHA256 for arm64
• SHA256 introduction
block rounds K Hash
SHA-256 512bits 64 32bits 32bits 256bits
© 2017 Arm Limited18
Optimize SHA256 for arm64 – Message schedule
src/crypto/sha256/sha256block.go
84 for i := 0; i < 16; i++ {
85 j := i * 4
86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3])
87 }
88 for i := 16; i < 64; i++ {
89 v1 := w[i-2]
90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10)
91 v2 := w[i-15]
92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3)
93 w[i] = t1 + w[i-7] + t2 + w[i-16]
94 }
for i := 16; i < 64; i+=4 {
SHA256SU0 Vn.S4, Vd.S4
SHA256SU1 Vm.S4, Vn.S4, Vd.S4
}
© 2017 Arm Limited19
Optimize SHA256 for arm64 – Hash Computation
src/crypto/sha256/sha256block.go
98 for i := 0; i < 64; i++ {
99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i]
100
101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c))
102
103 h = g
104 g = f
105 f = e
106 e = d + t1
107 d = c
108 c = b
109 b = a
110 a = t1 + t2
111 }
for i := 0; i < 64; i+=4 {
SHA256H Vm, Vn, Vd.4S
SHA256H2 Vm, Vn, Vd.4S
}
© 2017 Arm Limited20
Optimize SHA256 for arm64 – Implementation
src/crypto/sha256/sha256block_arm64.s
© 2017 Arm Limited21
Optimize SHA256 for arm64 – Result
• Optimization with assembly
• 2X-16X speedup
© 2017 Arm Limited22
Optimize IndexByte for arm64 – Before
H E L L O W O R L D …
R1R0
R2 D
R0
src/runtime/asm_arm64.s
© 2017 Arm Limited23
Optimize IndexByte for arm64 – After
• Assembly implementation with SIMD
• SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16
Compare 16 bytes in parallel
More details:
• Input slice shorter than 16
• Input slice address not 16-byte aligned
• Input slice size not 16-byte aligned
• Count trailing zeros (not leading zeros)
• Implementation:
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited24
Optimize IndexByte for arm64 – Result
• Optimization with SIMD
• 1.5X-8X speedup
© 2017 Arm Limited25
Work Summary
Disassembler (arm64):
https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930
https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530
Assembler (arm64):
https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511
https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951
https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350
https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653
Optimizations:
https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570
https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610
Others:
https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112
https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511
https://go-review.googlesource.com/c/arch/+/37172
© 2017 Arm Limited26
Next Steps
• Crypto optimizations:
• aes, elliptic, …
• SIMD optimizations:
• strings, bytes, runtime, reflect, …
• Compiler SSA arm64 back-end optimizations
• Others
• Internal arm64 linker
• Tool for arm64: race detector, memory sanitizer, …
• New architecture features
• ...
2727
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
© 2017 Arm Limited
© 2017 Arm Limited28
CGo
GO ABI C ABI
1 package print
2
3 // #include <stdio.h>
4 // #include <stdlib.h>
5 import "C"
6 import "unsafe"
7
8 func Print(s string) {
9 cs := C.CString(s)
10 C.fputs(cs, 11(*C.FILE)(C.stdout))
12 C.free(unsafe.Pointer(cs))
13 }
CGo
© 2017 Arm Limited29
Useful in
macros!
Branch Difference from GNU Assembly
• On arm64: B is alias for JMP, BL is alias for CALL
Jump to labels
JMP L1
NOP
L1:
NOP
L2: NOP
NOP
B L2
Call and Indirect Jump
BL $p.foo
MOV $p·foo, R3
CALL(R3)
B (R3)
MOV 0(R26), R4
JMP (R4)
Jump relative to PC
JMP 2(PC)
NOP
NOP
NOP
NOP
JMP -2(PC)

Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314

  • 1.
    © 2017 ArmLimited SFO17-314 Optimizing Golang for High Performance with ARM64 AssemblyWei Xiao Staff Software Engineer Wei.Xiao@arm.com September 27, 2017 Linaro Connect SFO17
  • 2.
    © 2017 ArmLimited2 Agenda • Introduction • Differences from GNU Assembly • Integrate assembly into Golang • Optimize CRC32 for arm64 • Optimize SHA256 for arm64 • Optimize IndexByte for arm64 • Work Summary and Next steps
  • 3.
    © 2017 ArmLimited3 Introduction • Assembly optimization benefits • Take advantages of ARMv8 capabilities – Hardware specific instructions (such as SVC, AES, SHA and etc.) – Vector (Single Instruction Multiple Data) Instructions • Others – No need for CGo dependency – Avoid runtime context switching overhead – Optimized code (vs Go compiler) – Faster compilation
  • 4.
    © 2017 ArmLimited4 Assembly Optimization Current Status • Go Standard packages with assembly optimization crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5 crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512 hash/crc32 math math/big reflect runtime runtime/cgo runtime/internal/atomicruntime/internal/sys strings sync/atomic syscall …… red – arm64 optimization ongoing black – no arm64 optimization
  • 5.
    © 2017 ArmLimited5 Assembly Terminology • Mnemonic • CALL, MOVW, MOVD, … • Register • R1, F0, V3, … • Immediate • $1, $0x100, … • Memory • (R1), 8(R3), … Registers in AArch64
  • 6.
    © 2017 ArmLimited6 Instruction Differences from GNU Assembly • Semi-abstract instruction set (Plan 9 from Bell Labs) • Architecture independent mnemonics like MOVD • Some architecture aspects shine through • Assembler may insert prologues, remove ‘unreachable’ instructions • Instructions may be expanded by the assembler • Not all instructions available • BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly 1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET
  • 7.
    © 2017 ArmLimited7 Operand Differences from GNU Assembly • Data flow from left to right • ADD R1, R2 → R2 += R1 • SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29) • Memory operands: base + offset • MOVH (R1), R2 → R2 = *R1 • MOVBU 8(R3), R4 → R4 = *(8 + R3) • MOVD mypackage·myvar(SB), R8 → R8 = *myvar • Addresses • MOVD $8(R1), R3 → R3 = R1 + 8 • MOVD $·myvar(SB), R4 → R4 = &myvar package mypackage var myvar int64 Unicode U+00B7
  • 8.
    © 2017 ArmLimited8 Go Assembly Extension for arm64 • Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd • Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T> • Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd • Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>] • Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>] • Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go • Full details • https://go-review.googlesource.com/c/go/+/41654
  • 9.
    © 2017 ArmLimited9 Assembly Build Rule • Toolchain will select appropriate assembly files according to GOOS+GOARCH • Using file extensions, e.g. • sys_linux_arm64.s • sys_darwin_arm64.s • Example: assembly files for: hash/crc32 • crc32_amd64p32.s • crc32_amd64.s • crc32_arm64.s • crc32_ppc64le.s crc32_table_ppc64le.s • crc32_s390x.s
  • 10.
    © 2017 ArmLimited10 Prototype • Function call is the bridge between Go and assembly • Function declaration • src/runtime/timestub.go • func walltime() (sec int64, nsec int32) • Function assembly implementation • runtime/sys_linux_arm64.s package (optional) function name Flag (optional) stack frame size arguments size (optional) Middle dot
  • 11.
    © 2017 ArmLimited11 Pseudo-registers • FP: Frame Pointer • Points to the bottom of the argument list • Offsets are positive • Offsets must include a name, e.g. arg+0(FP) • SP: Stack Pointer • Points to the top of the space allocated for local variables • Offsets are negative • Offsets must include a name, e.g. ptr-8(SP) • SB: Static Base • Named offsets from a global base Low address High address Low address High address
  • 12.
    © 2017 ArmLimited12 Calling Convention • All arguments are passed on the stack • Offsets from FP • Return arguments follow input arguments • Start of return arguments aligned to pointer size • All registers are caller saved, except: • Stack pointer register (RSP) • G context pointer register (R28) • Frame pointer (R29)
  • 13.
    © 2017 ArmLimited13 arm64 Stack Frame w/o frame pointer w/ frame pointer Low address High address
  • 14.
    © 2017 ArmLimited14 Optimize CRC32 for arm64 – Before • Pure Go table-driven implementation src/hash/crc32/crc32_generic.go 42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 { 43 crc = ^crc 44 for _, v := range p { 45 crc = tab[byte(crc)^v] ^ (crc >> 8) 46 } 47 return ^crc 48 }
  • 15.
    © 2017 ArmLimited15 Optimize CRC32 for arm64 – After • Assembly for arm64 src/hash/crc32/crc32_arm64.s 9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update … 46 done: 47 MOVWU R9, ret+32(FP) 48 RET 0(FP) ret p.cap p.len p.base crc 32(FP) 8(FP) 16(FP)
  • 16.
    © 2017 ArmLimited16 Optimize CRC32 for arm64 – Result • Optimization with assembly • 2X-7X speedup
  • 17.
    © 2017 ArmLimited17 Optimize SHA256 for arm64 • SHA256 introduction block rounds K Hash SHA-256 512bits 64 32bits 32bits 256bits
  • 18.
    © 2017 ArmLimited18 Optimize SHA256 for arm64 – Message schedule src/crypto/sha256/sha256block.go 84 for i := 0; i < 16; i++ { 85 j := i * 4 86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3]) 87 } 88 for i := 16; i < 64; i++ { 89 v1 := w[i-2] 90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10) 91 v2 := w[i-15] 92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3) 93 w[i] = t1 + w[i-7] + t2 + w[i-16] 94 } for i := 16; i < 64; i+=4 { SHA256SU0 Vn.S4, Vd.S4 SHA256SU1 Vm.S4, Vn.S4, Vd.S4 }
  • 19.
    © 2017 ArmLimited19 Optimize SHA256 for arm64 – Hash Computation src/crypto/sha256/sha256block.go 98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i] 100 101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c)) 102 103 h = g 104 g = f 105 f = e 106 e = d + t1 107 d = c 108 c = b 109 b = a 110 a = t1 + t2 111 } for i := 0; i < 64; i+=4 { SHA256H Vm, Vn, Vd.4S SHA256H2 Vm, Vn, Vd.4S }
  • 20.
    © 2017 ArmLimited20 Optimize SHA256 for arm64 – Implementation src/crypto/sha256/sha256block_arm64.s
  • 21.
    © 2017 ArmLimited21 Optimize SHA256 for arm64 – Result • Optimization with assembly • 2X-16X speedup
  • 22.
    © 2017 ArmLimited22 Optimize IndexByte for arm64 – Before H E L L O W O R L D … R1R0 R2 D R0 src/runtime/asm_arm64.s
  • 23.
    © 2017 ArmLimited23 Optimize IndexByte for arm64 – After • Assembly implementation with SIMD • SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16 Compare 16 bytes in parallel More details: • Input slice shorter than 16 • Input slice address not 16-byte aligned • Input slice size not 16-byte aligned • Count trailing zeros (not leading zeros) • Implementation: • https://go-review.googlesource.com/c/go/+/41654
  • 24.
    © 2017 ArmLimited24 Optimize IndexByte for arm64 – Result • Optimization with SIMD • 1.5X-8X speedup
  • 25.
    © 2017 ArmLimited25 Work Summary Disassembler (arm64): https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930 https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530 Assembler (arm64): https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951 https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350 https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653 Optimizations: https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570 https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610 Others: https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112 https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511 https://go-review.googlesource.com/c/arch/+/37172
  • 26.
    © 2017 ArmLimited26 Next Steps • Crypto optimizations: • aes, elliptic, … • SIMD optimizations: • strings, bytes, runtime, reflect, … • Compiler SSA arm64 back-end optimizations • Others • Internal arm64 linker • Tool for arm64: race detector, memory sanitizer, … • New architecture features • ...
  • 27.
  • 28.
    © 2017 ArmLimited28 CGo GO ABI C ABI 1 package print 2 3 // #include <stdio.h> 4 // #include <stdlib.h> 5 import "C" 6 import "unsafe" 7 8 func Print(s string) { 9 cs := C.CString(s) 10 C.fputs(cs, 11(*C.FILE)(C.stdout)) 12 C.free(unsafe.Pointer(cs)) 13 } CGo
  • 29.
    © 2017 ArmLimited29 Useful in macros! Branch Difference from GNU Assembly • On arm64: B is alias for JMP, BL is alias for CALL Jump to labels JMP L1 NOP L1: NOP L2: NOP NOP B L2 Call and Indirect Jump BL $p.foo MOV $p·foo, R3 CALL(R3) B (R3) MOV 0(R26), R4 JMP (R4) Jump relative to PC JMP 2(PC) NOP NOP NOP NOP JMP -2(PC)