Fast dynamic analysis, Kostya Serebryany


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Fast dynamic analysis, Kostya Serebryany

  1. 1. Fast dynamic program analysis Race detection Konstantin Serebryany <> May 20 2011
  2. 2. Agenda● Dynamic program analysis● Race detection: theory● ThreadSanitizer: race detector● Making ThreadSanitizer faster● Announcement of a new tool (premiere)● War stories
  3. 3. Dynamic analysis● Execute program and monitor interesting events● Lightweight: no need to monitor memory accesses ○ Leak detection (monitor malloc/free) ○ Deadlock detection (monitor lock/unlock)● Heavyweight: monitor memory accesses: ○ Memory bugs: ■ Ouf-of-bound, use-after-free, uninitialized reads ○ Races ○ Pointer taintedness analysis● Many more: profiling, coverage, ...
  4. 4. Data races are scaryA data race occurs when two or more threads concurrentlyaccess a shared memory location and at least one of theaccesses is a write. std::map<int,int> my_map; void Thread1() { void Thread2() { my_map[123] = 1; my_map[345] = 2; } } Our goal: find races in Google code
  5. 5. Happens-before (precedes) partial order on all eventsSegment: a sequence of READ/WRITE events of one threadSignal(obj) Wait(obj) is a happens-before arcSeg1 h.b. Seg4 -- segments belong to the same thread.Seg1 h.b.Seg5 -- due to Signal/Wait pair with a macthing object.Seg1 h.b. Seg7 -- happens-before is transitive.Seg3 and Seg6 -- no ordering constraint.
  6. 6. LockSetvoid Thread1() { void Thread2() { mu1.Lock(); mu1.Lock(); mu2.Lock(); mu3.Lock(); *X = 1; *X = 2; mu2.Unlock(); mu3.Unlock(); mu1.Unlock(); ... mu1.Unlock(); ... ● LockSet: a set of locks held during a memory access ○ Thread1: {mu1, mu2} ○ Thread2: {mu1, mu3} ● Common LockSet: intersection of LockSets ○ {mu1}
  7. 7. Dynamic race detector: state machine● Intercepts program events at run-time ○ Memory access: READ, WRITE ○ Synchronization: LOCK, UNLOCK, SIGNAL, WAIT● Maintains global state ○ Locks, other synchronization events, threads ○ Memory allocation● Maintains shadow state for each memory location (byte) ○ Records previous accesses ○ Reports race in appropriate state. E.g. current WRITE ■ ... does not happen-before previous READ ■ ... and previous WRITE have no common Locks.
  8. 8. ThreadSanitizer● Implemented in late 2008, opensource.● Initially based on Valgrind binary translation framework.● SLOW, 20x-50x slowdown. ○ Binary translation overhead is 1.5-3.x ○ Serializes threads (up to 8x on our machines) ○ Slow generalized state machine.● Slow is bad: ○ Many tests (and bugs) are timing dependent ○ Users are unhappy ○ Machines cost money● Still very useful -- found thousands races all over Google. ○ Server-side software (e.g. bigtable, GWS) ○ Google Chrome browser
  9. 9. ThreadSanitizer: algorithm
  10. 10. Speedup #1: fast path sate machine● Observation: 90%-99% of reads/writes are thread-private● Simplification: special case for thread-private access ○ Very few global objects touched ○ No loops (~20 hand-written if/else statements) ○ 1.5x speedup
  11. 11. Speedup #2: parallel fast path● Fast path does not touch global state (almost) ○ easy to parallelize (fast path w/o a lock, fallback to serialized slow path)● Valgrind is not parallel, so used PIN ( ○ Good alternative, also works on Windows. ○ But non-opensource is a huge disadvantage.● Up to #CPUs times speedup (for Chrome: ~2x).● Problem: how to fight with races (Valgrind cant run PIN)? ○ OUCH!
  12. 12. Speedup #3: faster instrumentation● Valgrind/PIN add 1.5x-3x slowdown. Why pay that price?● Use compiler instrumentation ○ + Less run-time overhead ○ - Need to recompile all libraries to catch races there● Implemented LLVM and GCC plugins. Indeed 1.5x-3x faster.● Bonus: now can detect races in the parallel race detector ○ TSan-Valgrind over TSan-LLVM● Result: up to 50M memory events per second
  13. 13. Speedup #4: sampling ● Idea: ignore some accesses in hot region ○ LiteRace, PLDI09 ● Execution counter for every code region (function or smaller). ● Until the counter is small, dont ignore the region ● Larger counter -- ignore more frequently ● Moderate sampling rate: looses no races, 2x-4x speedup. if (num_to_skip-- <= 0) { HandleThisRegion();num_to_skip = (counter>>(sampling_rate))+1; counter += num_to_skip }
  14. 14. Results● 1.5x-4x slowdown● Can run Chrome interactively ○ Play Farmville or use GMail.● Finds more bugs per day.
  15. 15. Premiere: AddressSantizer (ASAN)● Many memory error detectors exist: ○ Slow: Valgrind, DrMemory, Purify, Boundschecker, Insure++, Intel Inspector, mudflap, ... ○ Incomplete: libgmalloc, Electric Fence, Page Heap, ...● AddressSanitizer (ASAN): fast address sanity checker ○ Use-after-free ○ Out-of-bound (aka buffer overflow) for heap and stack ○ Double-free, etc ○ Linux, Mac, ChromeOS ○ 2x-2.5x slowdown (faster than Debug build!) ○ LLVM instrumentation module + specialized malloc
  16. 16. Generic addressability checking● malloc()/free() replacement library (most tools): ○ poison redzones around malloc-ed memory ○ poison memory on free() ○ delay reuse of free-ed memory● Stack poisoning (few tools)● Instrument all loads and stores ○ if (IsPoisoned(mem)) BANG();● The tricky part: how to implement IsPoisoned and BANG
  17. 17. AddressSanitizer algorithm[0x80000000, 0xffffffff] Mem => Shadow is a 8 to 1 mapping Instrumenting 8 byte access to Mem: Shadow = (Mem>>3)+0x20000000;[0x60000000, 0x7fffffff] if (*Shadow) { // 1 byte load Bad = Shadow * 2;[0x40000000, 0x47ffffff][0x30000000, 0x3fffffff] *Bad = 0; // SEGV! }[0x20000000, 0x23ffffff][0x00000000, 0x1fffffff]
  18. 18. AddressSanitizer demo