3. If you don’t care about
Performance
You are in the wrong talk.
@postwait should throw you out.
4. Perhaps some justification is warranted
Performance…
makes a better user experience
increases loyalty
reduces product abandonment
increases speed of product development
lowers total cost of ownership
builds more cohesive teams
6. It’s all about latency…
Throughput vs. Latency
Lower latency often
affords increased throughput.
Latency is the focus.
https://www.flickr.com/photos/poeloq/3140100971
7. Generally, time should be measured in seconds.
UX latency should be in milliseconds.
Time
Users can’t observe microseconds.
Users quit over seconds.
Users experience is measured in milliseconds.
(with at least microsecond precision)
8. Music is all about the space between the notes.
Connectedness
Performance is about how quickly you can
complete some work.
In a connected service architecture,
performance is also about the time spent
between the service layers.
11. Report on and celebrate
Large Collective Wins
https://www.flickr.com/photos/tomer_a/1130647512
12. Transcendant Tooling
Tooling must transcend the team
and keep consistent conversation
https://www.flickr.com/photos/meanestindian/2260343214
13. Large-Scale Distributed Systems Tracing Infrastructure
Dapper
Google published a paper:
research.google.com/pubs/pub36356.html
As usual, code never saw the outside.
14. Large-Scale Distributed Systems Tracing Infrastructure
Dapper
Google published a paper:
research.google.com/pubs/pub36356.html
As usual, code never saw the outside.
web api
data agg
mq
db
data store
cep
alerting
15. The Basics
❖ Focused on User Interactions (not req.)
❖ Each new request is assigned a “Trace ID”
❖ The service records start/stop/etc. against
a “Span ID” (first Span ID == Trace ID)
❖ In the context of a “Span ID”,
each remote call get’s a new Span ID,
with the Parent Span ID set to the context.
16. Example
Web Request: /do/magic
(no X-B3-TraceId header)
Creates TraceId T1, SpanId T1
Notes “sr” (server receive)
needs to tall to service MS
Creates new SpanId T2
Notes “cs” (client send)
Request to MS
Notes “cr” (client receive)
Notes “ss” (server send)
Sends response
Async publish span(s)
GET /pixie/dust
X-B3-TraceId: T1
X-B3-ParentSpanId: T1
X-B3-SpanId: T2
Extracts headers
Notes “sr” (server receive)
performs actions
Notes “ss” (server send)
Responds
Async publish span(s)
Scribe
20. A pseudo-Dapper
Zipkin
Twitter sought to (re)implement Dapper.
Disappointingly few improvements.
Some unfortunate UX issues.
Sound. Simple. Valuable.
21. Thrift and Scribe should both die.
Scribe is Terrible
Terrible. Terrible Terrible.
Thrift is terrible.
Scribe is “strings” in Thrift.
Performance focused people don’t use strings.
22.
23. The whole point is to be low overhead
Screw Scribe
We push raw thrift over Fq
github.com/circonus-labs/fq
Completely async publishing,
lock free if using the C library.
Consolidating Zipkin’s bad decisions:
github.com/circonus-labs/fq2scribe
24. Telling computers what to do.
Zipkin is Java/Scala
Wrote C support:
github.com/circonus-labs/libmtev
Wrote Perl support:
github.com/circonus-labs/circonus-tracer-perl
31. Celebration
Day 4-7
Noticed frequent 150ms stalls in internal REST.
Often: 90%+
Found a libcurl issue (async resolver).
Shaved 150ms*(n*0.9) off ~50% of page loads.
32. You can do all of this at work.
Go To Work
And have a deeply technical
cross-team conversation
about performance