Go runtime related problems in TiDB
in production
The Dark Side Of Go
Ed Huang, CTO, PingCAP
What's TiDB?
Open source, distributed SQL database
written in Go (and Rust).
pingcap/tidb
● Latency in scheduler
■ case study: latency jitter under high load due to fair
scheduling policy
● Memory control
■ case study: Abnormal memory usage due to
Transparent Huge Pages
● GC Related
■ case study: GC sweep caused latency jitter
■ case study: Lock and NUMA aware
Agenda
Part I - Latency in scheduler
● The client consists of a goroutine and a channel
○ The channel batch the request
○ A goroutine run read-send-recv loop
○ Typical MPSC
Background
Description
● When the machine load and CPU usage is high, but the network IO pressure is reasonable
● But the RPC metrics show a high latency
Network
Ready Goroutine
Wakeup
4.368ms
● Network IO is ready => goroutine wake up == 4.368ms
○ Sometime even 10ms+ latency here!
○ The time spend on runtime schedule is not negligible
● When CPU is overload, which goroutine should be given priority?
Analysis
● The goroutine is special, it block all the callers
● The scheduler treat them equally
Analysis
● Under heavy workload, goroutines get longer to be scheduled
● The runtime scheduling does not consider priority
● CPU dense workload could affect IO latency
Conclusion
Part II - Memory control
● Go Runtime
○ Allocated from OS (mmaped)
○ Managed Memory
■ Should the memory be returned to the OS?
○ Inuse Memory
■ How much memory is *really* used?
○ GC
■ After each GC, the unused objects are released
○ Idle Memory
■ the unused objects are returned to Managed Memory, not OS
Background: What are we talking about when we
talk about Memory
Background: What are we talking about when we
talk about Memory
● TLB (translation lookaside buffer)
○ A CPU cache for virtual memory to physical
addresses mapping
○ Size limited, when cache miss the memory
access become slower
● Default page size 4K
● THP, increase page size to 2M or more
○ Page size larger, so the page entries for a
process is less
○ Less page entries means better cache hit rate
○ Result in better memory access performance
Background: THP (transparent huge pages)
Case study: THP (transparent huge pages)
TiDB memory usage is abnormal, and heap memory usage is incorrect.
Description
● The TiDB server memory footprint is abnormal
● The memory available on this node is not too much
Description
● The Go Runtime thinks it does not use much memory
● The OS does not release the memory (RSS is high)
Investigate
Alloc from OS
Reserved by Go Runtime
Check /proc/{PID}/smaps to see the memory
mapping of the TiDB server processor
AnonHugePages 1980416 kB
The memory is occupied by THP
Investigate
Compare with other nodes in the cluster with THP disabled
Investigate
● So, the root cause must be related to THP (transparent huge pages)
● But … why?
Analysis
● Go Runtime manage memory at 8K size granularity
● Go Runtime give hint to OS about the use of the memory block
● With THP enabled, the memory block address is round up to huge
page boundary
● Fragment!!!
● Fragments make the memory difficult to be reclamed by the OS
Analysis
● And more confusing behavior by the OS: merging
pages into huge pages
○ The user program return 4K pages to OS
○ khugepaged turn a single 4K page to 2M
○ bloating the process’s RSS by as much as 512x
● Some clues from the Go source code
○ https://github.com/golang/go/blob/release-branch.go1.11/src
/runtime/mem_linux.go#L38
Analysis
Follow up
● TiDB has now suggest user to disabled the use of THP
○ Just like MySQL / Oracle / Redis / … did
Part III - GC Related
A brief introduction about GC
What happen when collector can not catch up allocation rate?
● How the garbage recycle interrupt the user program in memory allocation?
○ 25% CPU is dedicated to GC
○ MARK ASSIST
○ …
○
A brief introduction about GC
Case study: GC Sweep caused latency jitter
weird latency for simple requests (primary key updates)
weird latency from 30ms to 800ms
CPU Network Disk load is normal
● The query duration is unstable, varies in a wide range from 30ms to 800ms
● No problem with the IO, CPU, network
Description
So, where does the latency come from?
Investigation
curl http://0.0.0.0:10080/debug/pprof/trace?seconds=60
Get the trace information
Observation:
The query is blocked by GC sweep
Investigation
● The garbage collector can stop individual goroutines, even without STW
○ eg. A goroutine that needs to allocate memory right after a GC cycle is required to help with
the sweep work
● But why … so long?
○ Let’s dig the rabbit hole
Analysis
Analysis
● Wait, scan 1G memory?
● So, that’s why it takes so long!
● for only releasing 8K?
Analysis
And we find a similar issue latency in sweep assists when there's no garbage, and also the fix
● Verified with the fix, the latency jitter gone
○ yeah, that’s the problem
Analysis
● One more thing … how TiDB trigger that bug?
○ statistics module uses a lot of long-lived 8K
objects
○ all these objects are all alive after GC
Analysis
● Sweep, until it finds a free slot ...
● … or there are no more spans to sweep
● a lot of inuse objects after the mark phase
● Although the STW duration is low enough for Go GC
● Unexpected latency jitter caused by GC may still exist
Conclusion
Case study: Lock and NUMA aware
QPS bottleneck, but CPU usage ~60%
● TiDB server CPU usage is around 60%
● Networking and IO are both OK
● Increase the benchmark client concurrency, throughput does not increase
● It seems the resources are not exhausted, but where is the bottleneck?
Description
● Something make the CPU poorly used
● So the bottleneck should be one of them?
○ Mutex
○ GC
○ NUMA
Hypothesis
curl -G "172.16.5.2:10080/debug/pprof/mutex" > mutex.profile
go tool pprof -http 0.0.0.0:4002 mutex.profile
Is it caused by mutex?
We found some useless mutex in modules in :
● GRPC
● log slow query
● log slow cop task
● query feedback
After optimization:
QPS 4.3K => 5.2K, CPU 58.2% => 64%
Bind CPUs to NUMA nodes
Is it caused by NUMA?
● QPS 5.2K => 7.59K
● P999 3.06s => 2s
● CPU usage 64% => 88.49%
● Minor page fault 229.8K => 8.68K
This single change get the biggest improvement!
RAW Optimized Mutex Bind CPU Core +
Optimized Mutex
● Off CPU analysis with linuxki
○ From the scheduler activity report, 67% of time is spend on sleeping
Is it caused by GC?
runtime.(*mheap).freeSpan!!! it’s about 14%
Is it caused by GC?
● There is a lock on the page allocator, see this issue make the page allocator scale
● A high allocation rate combined with a high degree of parallelism leads to significant contention
● The performance is not scalable on a machine with lots of CPUs
○ PS: The Go folks have already done a lot of work on it
Analysis
● NUMA (non-uniform memory access)
○ sockets
○ CPUs
○ local memory access
● OS allocation strategy
○ balanced
○ local
● Minor Page Faults might be related to
the page migration caused by NUMA
balancing
Background: a bit about NUMA
● Unfortunately, Go's GC is not NUMA aware
Analysis: NUMA aware
● Those factors make Go runtime less scalable on large number of cores
○ Lock contention in the allocation path
○ GC is not NUMA aware
○ ...
Conclusion
Recap
● CASE1: Batching client requests
■ latency in scheduler
● CASE2: THP(transparent huge pages)
■ memory control
● CASE3: GC sweep caused latency jitter
● CASE4: Lock and NUMA aware
■ scalability on multiple cores
Thank you!
Twitter: @dxhuang
pingcap.com
github.com/pingcap/tidb

The Dark Side Of Go -- Go runtime related problems in TiDB in production

  • 1.
    Go runtime relatedproblems in TiDB in production The Dark Side Of Go Ed Huang, CTO, PingCAP
  • 2.
    What's TiDB? Open source,distributed SQL database written in Go (and Rust). pingcap/tidb
  • 3.
    ● Latency inscheduler ■ case study: latency jitter under high load due to fair scheduling policy ● Memory control ■ case study: Abnormal memory usage due to Transparent Huge Pages ● GC Related ■ case study: GC sweep caused latency jitter ■ case study: Lock and NUMA aware Agenda
  • 4.
    Part I -Latency in scheduler
  • 5.
    ● The clientconsists of a goroutine and a channel ○ The channel batch the request ○ A goroutine run read-send-recv loop ○ Typical MPSC Background
  • 6.
    Description ● When themachine load and CPU usage is high, but the network IO pressure is reasonable ● But the RPC metrics show a high latency Network Ready Goroutine Wakeup 4.368ms
  • 7.
    ● Network IOis ready => goroutine wake up == 4.368ms ○ Sometime even 10ms+ latency here! ○ The time spend on runtime schedule is not negligible ● When CPU is overload, which goroutine should be given priority? Analysis
  • 8.
    ● The goroutineis special, it block all the callers ● The scheduler treat them equally Analysis
  • 9.
    ● Under heavyworkload, goroutines get longer to be scheduled ● The runtime scheduling does not consider priority ● CPU dense workload could affect IO latency Conclusion
  • 10.
    Part II -Memory control
  • 11.
    ● Go Runtime ○Allocated from OS (mmaped) ○ Managed Memory ■ Should the memory be returned to the OS? ○ Inuse Memory ■ How much memory is *really* used? ○ GC ■ After each GC, the unused objects are released ○ Idle Memory ■ the unused objects are returned to Managed Memory, not OS Background: What are we talking about when we talk about Memory
  • 12.
    Background: What arewe talking about when we talk about Memory
  • 13.
    ● TLB (translationlookaside buffer) ○ A CPU cache for virtual memory to physical addresses mapping ○ Size limited, when cache miss the memory access become slower ● Default page size 4K ● THP, increase page size to 2M or more ○ Page size larger, so the page entries for a process is less ○ Less page entries means better cache hit rate ○ Result in better memory access performance Background: THP (transparent huge pages)
  • 14.
    Case study: THP(transparent huge pages) TiDB memory usage is abnormal, and heap memory usage is incorrect.
  • 15.
    Description ● The TiDBserver memory footprint is abnormal
  • 16.
    ● The memoryavailable on this node is not too much Description
  • 17.
    ● The GoRuntime thinks it does not use much memory ● The OS does not release the memory (RSS is high) Investigate Alloc from OS Reserved by Go Runtime
  • 18.
    Check /proc/{PID}/smaps tosee the memory mapping of the TiDB server processor AnonHugePages 1980416 kB The memory is occupied by THP Investigate
  • 19.
    Compare with othernodes in the cluster with THP disabled Investigate
  • 20.
    ● So, theroot cause must be related to THP (transparent huge pages) ● But … why? Analysis
  • 21.
    ● Go Runtimemanage memory at 8K size granularity ● Go Runtime give hint to OS about the use of the memory block ● With THP enabled, the memory block address is round up to huge page boundary ● Fragment!!! ● Fragments make the memory difficult to be reclamed by the OS Analysis
  • 22.
    ● And moreconfusing behavior by the OS: merging pages into huge pages ○ The user program return 4K pages to OS ○ khugepaged turn a single 4K page to 2M ○ bloating the process’s RSS by as much as 512x ● Some clues from the Go source code ○ https://github.com/golang/go/blob/release-branch.go1.11/src /runtime/mem_linux.go#L38 Analysis
  • 23.
    Follow up ● TiDBhas now suggest user to disabled the use of THP ○ Just like MySQL / Oracle / Redis / … did
  • 24.
    Part III -GC Related
  • 25.
  • 26.
    What happen whencollector can not catch up allocation rate? ● How the garbage recycle interrupt the user program in memory allocation? ○ 25% CPU is dedicated to GC ○ MARK ASSIST ○ … ○ A brief introduction about GC
  • 27.
    Case study: GCSweep caused latency jitter weird latency for simple requests (primary key updates) weird latency from 30ms to 800ms CPU Network Disk load is normal
  • 28.
    ● The queryduration is unstable, varies in a wide range from 30ms to 800ms ● No problem with the IO, CPU, network Description So, where does the latency come from?
  • 29.
    Investigation curl http://0.0.0.0:10080/debug/pprof/trace?seconds=60 Get thetrace information Observation: The query is blocked by GC sweep
  • 30.
  • 31.
    ● The garbagecollector can stop individual goroutines, even without STW ○ eg. A goroutine that needs to allocate memory right after a GC cycle is required to help with the sweep work ● But why … so long? ○ Let’s dig the rabbit hole Analysis
  • 32.
    Analysis ● Wait, scan1G memory? ● So, that’s why it takes so long! ● for only releasing 8K?
  • 33.
    Analysis And we finda similar issue latency in sweep assists when there's no garbage, and also the fix
  • 34.
    ● Verified withthe fix, the latency jitter gone ○ yeah, that’s the problem Analysis ● One more thing … how TiDB trigger that bug? ○ statistics module uses a lot of long-lived 8K objects ○ all these objects are all alive after GC
  • 35.
    Analysis ● Sweep, untilit finds a free slot ... ● … or there are no more spans to sweep ● a lot of inuse objects after the mark phase
  • 36.
    ● Although theSTW duration is low enough for Go GC ● Unexpected latency jitter caused by GC may still exist Conclusion
  • 37.
    Case study: Lockand NUMA aware QPS bottleneck, but CPU usage ~60%
  • 38.
    ● TiDB serverCPU usage is around 60% ● Networking and IO are both OK ● Increase the benchmark client concurrency, throughput does not increase ● It seems the resources are not exhausted, but where is the bottleneck? Description
  • 39.
    ● Something makethe CPU poorly used ● So the bottleneck should be one of them? ○ Mutex ○ GC ○ NUMA Hypothesis
  • 40.
    curl -G "172.16.5.2:10080/debug/pprof/mutex"> mutex.profile go tool pprof -http 0.0.0.0:4002 mutex.profile Is it caused by mutex? We found some useless mutex in modules in : ● GRPC ● log slow query ● log slow cop task ● query feedback After optimization: QPS 4.3K => 5.2K, CPU 58.2% => 64%
  • 41.
    Bind CPUs toNUMA nodes Is it caused by NUMA? ● QPS 5.2K => 7.59K ● P999 3.06s => 2s ● CPU usage 64% => 88.49% ● Minor page fault 229.8K => 8.68K This single change get the biggest improvement!
  • 42.
    RAW Optimized MutexBind CPU Core + Optimized Mutex
  • 43.
    ● Off CPUanalysis with linuxki ○ From the scheduler activity report, 67% of time is spend on sleeping Is it caused by GC?
  • 44.
  • 45.
    ● There isa lock on the page allocator, see this issue make the page allocator scale ● A high allocation rate combined with a high degree of parallelism leads to significant contention ● The performance is not scalable on a machine with lots of CPUs ○ PS: The Go folks have already done a lot of work on it Analysis
  • 46.
    ● NUMA (non-uniformmemory access) ○ sockets ○ CPUs ○ local memory access ● OS allocation strategy ○ balanced ○ local ● Minor Page Faults might be related to the page migration caused by NUMA balancing Background: a bit about NUMA
  • 47.
    ● Unfortunately, Go'sGC is not NUMA aware Analysis: NUMA aware
  • 48.
    ● Those factorsmake Go runtime less scalable on large number of cores ○ Lock contention in the allocation path ○ GC is not NUMA aware ○ ... Conclusion
  • 49.
    Recap ● CASE1: Batchingclient requests ■ latency in scheduler ● CASE2: THP(transparent huge pages) ■ memory control ● CASE3: GC sweep caused latency jitter ● CASE4: Lock and NUMA aware ■ scalability on multiple cores
  • 50.