Wanna Go
So You
Fast?
Strange Loop 2017 @tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
this one weird trick
Make your code faster with
@tyler_treat
this one weird trick
Make your code faster with
@tyler_treat
So You Wanna
Subvert Go?
@tyler_treat
Spoiler Alert:

Go is not a

systems language…
@tyler_treat
but that doesn’t mean you
can’t build internet-scale
systems with it.
@tyler_treat
@tyler_treat
This is a talk about how to
write terrible Go code.
@tyler_treat@tyler_treat
@tyler_treat
Because this is a talk
about trade-offs.
@tyler_treat
- Messaging Nerd @ Apcera

- Working on nats.io 

- Distributed systems

- bravenewgeek.com
Tyler Treat
@tyler_treat@tyler_treat
@tyler_treat
matter?
Why does this talk
@tyler_treat
The
compiler
isn’t magic.
@tyler_treat
The
compiler
isn’t magic.
@tyler_treat
You have to be

mindful of performance

when it matters.
@tyler_treat@tyler_treat
Where bad things hide
@tyler_treat@tyler_treat
Where bad things hideWhere we’re usually looking
@tyler_treat
Tire fires

at scale
@tyler_treat
@tyler_treat@tyler_treat@tyler_treat
@tyler_treat@tyler_treat@tyler_treat
@tyler_treat@tyler_treat@tyler_treat
@tyler_treat
Overview
- Measuring performance

- Language features

- Memory management

- Concurrency and multi-core
@tyler_treat
Overview
- Measuring performance

- Language features

- Memory management

- Concurrency and multi-core
@tyler_treat
Disclaimer:

Don’t blindly apply
optimizations presented.
@tyler_treat
tl;dr of this talk is

“IT DEPENDS!”
@tyler_treat
Measure
Optimize
@tyler_treat
Measurement Techniques
- pprof

- memory

- cpu

- blocking

- GODEBUG

- gctrace

- schedtrace

- allocfreetrace

- Benchmarking

- Code-level: testing.B

- System-level: HdrHistogram (https://github.com/codahale/hdrhistogram)

bench (https://github.com/tylertreat/bench)
@tyler_treat@tyler_treat
@tyler_treat
The only way to get good at something
is to be really fucking bad at it

for a long time.
@tyler_treat
Benchmarking…
a great way to rattle the

Hacker News fart chamber.
@tyler_treat
Overview
- Measuring performance

- Language features

- Memory management

- Concurrency and multi-core
@tyler_treat
channels
@tyler_treat
“Instead of explicitly using locks to mediate access
to shared data, Go encourages the use of channels
to pass references to data between goroutines.”

https://blog.golang.org/share-memory-by-communicating
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
USE CHANNELS TO COORDINATE,
NOT SYNCHRONIZE.
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
defer
@tyler_treat@tyler_treat
@tyler_treat
Is defer still slow?
@tyler_treat@tyler_treat
@tyler_treat
The Secret Life

of interface{}
@tyler_treat
type Stringer interface {

String() string

}
https://research.swtch.com/interfaces
@tyler_treat
type Stringer interface {

String() string

}

type Binary uint64
https://research.swtch.com/interfaces
@tyler_treat
type Stringer interface {

String() string

}

type Binary uint64
200
b := Binary(200)
https://research.swtch.com/interfaces
@tyler_treat
type Stringer interface {

String() string

}

type Binary uint64

func (i Binary) String() string {
return strconv.FormatUint(uint64(i), 2)
}
200
b := Binary(200)
https://research.swtch.com/interfaces
@tyler_treat
type Stringer interface {

String() string

}
https://research.swtch.com/interfaces
s := Stringer(b)
Stringer
tab
data
@tyler_treat
s := Stringer(b)
Stringer
tab
data
.

.

.
itable(Stringer, Binary)
type
fun[0]
type(Binary)
(*Binary).String
type Stringer interface {

String() string

}
https://research.swtch.com/interfaces
@tyler_treat
tab
data
200
Binary
s := Stringer(b)
Stringer
.

.

.
itable(Stringer, Binary)
type
fun[0]
type(Binary)
(*Binary).String
type Stringer interface {

String() string

}
https://research.swtch.com/interfaces
@tyler_treat
@tyler_treat
So what?
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
Sorting 100M Interfaces
@tyler_treat@tyler_treat
Sorting 100M Interfaces
@tyler_treat@tyler_treat
Sorting 100M Structs
@tyler_treat@tyler_treat
Sorting 100M Structs
@tyler_treat
$ go test -bench=. -gcflags="-m"
@tyler_treat
$ go test -bench=. -gcflags="-m"
@tyler_treat@tyler_treat
@tyler_treat
$ go test -bench=. -gcflags="-l"
@tyler_treat@tyler_treat
Struct

No Inlining
Interface

No Inlining
@tyler_treat@tyler_treat
Struct

No Inlining
Interface

No Inlining
@tyler_treat@tyler_treat
Struct

No Inlining
Interface

No Inlining
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
x.(*T) inlined
@tyler_treat@tyler_treat
SSA backend &

remaining type

conversions inlined
x.(*T) inlined
@tyler_treat@tyler_treat
@tyler_treat
@tyler_treat@tyler_treat
Struct Interface
@tyler_treat@tyler_treat
Struct Interface
@tyler_treat@tyler_treat
@tyler_treat
$ go test -bench=. -gcflags="-S"
@tyler_treat
$ go test -bench=. -gcflags="-S"
@tyler_treat
$ go test -bench=. -gcflags="-S"
@tyler_treat
Key Insight:
If performance matters,

write type-specific code.
@tyler_treat
Overview
- Measuring performance

- Language features

- Memory management

- Concurrency and multi-core
@tyler_treat
[]byte to string

conversions
@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
What’s going on here?
@tyler_treat@tyler_treat
@tyler_treat
memory allocation
@tyler_treat@tyler_treat
@tyler_treat
How is sync.Pool so fast?
@tyler_treat
Per-CPU storage!
@tyler_treat@tyler_treat
https://golang.org/src/sync/pool.go
@tyler_treat@tyler_treat
https://golang.org/src/sync/pool.go
@tyler_treat@tyler_treat
@tyler_treat
Overview
- Measuring performance

- Language features

- Memory management

- Concurrency and multi-core
@tyler_treat
“We generally don’t want sync/atomic to be used
at all…Experience has shown us again and again
that very very few people are capable of writing
correct code that uses atomic operations…”

—Ian Lance Taylor
@tyler_treat
@tyler_treat@tyler_treat
Subscribers Messages
Fast Topic Matching
http://bravenewgeek.com/fast-topic-matching/
@tyler_treat@tyler_treat
Subscribers Messages
Fast Topic Matching
http://bravenewgeek.com/fast-topic-matching/
@tyler_treat@tyler_treat
Fast Topic Matching
@tyler_treat@tyler_treat
Fast Topic Matching
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
Fast Topic Matching
@tyler_treat@tyler_treat
Concurrent

80,000 inserts

80,000 lookups

@tyler_treat@tyler_treat
Ctrie
@tyler_treat@tyler_treat
G1
G1
1. Assign a generation, G1, to each

I-node (empty struct).
Ctrie
@tyler_treat
1. Assign a generation, G1, to each

I-node (empty struct).

2. Add new node by copying I-node with
updated branch and generation then
GCAS, i.e. atomically:

- compare I-nodes to detect tree

mutations.

- compare root generations to detect

snapshots.
@tyler_treat
G2
G1
Ctrie
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
The Go race detector

doesn’t protect you from

doing dumb stuff.
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
Side note:

unsafe is, in fact, unsafe.
@tyler_treat
“Packages that import unsafe may depend on internal
properties of the Go implementation. We reserve the
right to make changes to the implementation that may
break such programs.”

https://golang.org/doc/go1compat
@tyler_treat
@tyler_treat
Key Insight:
Struct layout can make

a big difference.
@tyler_treat@tyler_treat
Mechanical

Sympathy
@tyler_treat
https://github.com/Workiva/go-datastructures/blob/master/queue/ring.go
@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
https://golang.org/src/sync/rwmutex.go
@tyler_treat@tyler_treat
https://golang.org/src/sync/rwmutex.go
@tyler_treat
CPU
reader
reader
reader
RWMutex
@tyler_treat
CPU
reader
reader
CPU
readerreader
reader
RWMutex
@tyler_treat
CPU
reader
reader
CPU
reader
readerreader
reader
CPU
readerreader
reader
RWMutex
@tyler_treat
CPU
reader
reader
CPU
reader
readerreader
reader
CPU
readerreader
CPU
reader
reader
reader
RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
@tyler_treat
RWMutex
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
RWMutex RWMutex RWMutex
CPU
reader
CPU
readerreaderreader
CPU
readerreader
CPU
readerreader
U
writer
CPU
reader
reader
CPU
reader
readerreader
readerreader
writer
CPU
reader
readerreader
reader
CPU
reader
writerreader
reader
CPU
reader
reader
CPU
reader
readerreader
writerreader
reader
CPU
reader
readerreader
reader
CPU
reader
readerreader
reader
reader readerreaderreader readerreaderreaderreader
U
reader
reader
ader
ader
U
reader
reader
ader
ader
ader
readerader
CPU
read
readreader
reader
CPU
read
readreader
reader
CPU
readreader
readreader
@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
How to create

CPU->RWMutex

mapping?
@tyler_treat@tyler_treat
https://github.com/jonhoo/drwmutex/blob/master/cpu_amd64.s
@tyler_treat
/proc/cpuinfo
@tyler_treat@tyler_treat
@tyler_treat
memory RWMutex1
24 bytes
@tyler_treat
RWMutex1 RWMutex2memory
24 bytes
@tyler_treat
RWMutex1 RWMutex2 RWMutex3memory
24 bytes
@tyler_treat
RWMutex1 RWMutex2 RWMutex3 RWMutexN…memory
24 bytes
@tyler_treat
RWMutex1 RWMutex2 RWMutex3 RWMutexN…memory
24 bytes
64 bytes
(cache line size)
@tyler_treat
RWMutex1 RWMutex2 RWMutex3 RWMutexN…memory
24 bytes
64 bytes
(cache line size)
Cache rules everything around me
@tyler_treat
https://github.com/jonhoo/drwmutex/blob/master/drwmutex.go
@tyler_treat
@tyler_treat
https://github.com/jonhoo/drwmutex/blob/master/drwmutex.go
@tyler_treat
@tyler_treat
padding …
64 bytes
(cache line size)
memory
24 bytes
RWMutex1
Cache rules everything around me
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat@tyler_treat
@tyler_treat
Go makes concurrency

easy enough to be
dangerous.
@tyler_treat
Conclusions
@tyler_treat
The standard library provides

general solutions (and they’re

generally what you should use).
1
@tyler_treat
Seemingly small, idiomatic

decisions can have profound

performance implications.
2
@tyler_treat
The Go toolchain has lots

of tools for analyzing your

code—learn them.
3
@tyler_treat
Go’s compiler and runtime

continue to improve.
4
@tyler_treat
Performance profile can

change dramatically

between releases.
5
@tyler_treat
Relying on assumptions

can be fatal.
6
@tyler_treat
Code is marginal,

architecture is material.
7
@tyler_treat
Peeking behind the curtains

can pay dividends.
8
@tyler_treat
Above all, optimize for the

right trade-off.
9
@tyler_treat
Thanks!

So You Wanna Go Fast?