During the development of an internal deployment tool we hit the type of problem we all dread – a deadlock triggered in a core Ruby module. This talk covers how this specific bug was identified, some general advice on how to find these kinds of issues and what Ruby could learn from other languages in this area.
2. Structure of the Talk
• Signals & Stacktraces
• Reproduce / Examine / Experiment
• Root Cause Analysis
• Results From MRI, JRuby, RBX
• Lessons from Other Languages
22. Examine
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end
@mon_mutex = Mutex.new
@mon_owner = nil
@mon_count = 0
Grrrrr...
23. Rant
• I’ve debugged locks involving reentrant
mutexes more times than I can remember
• If you ever feel like using a reentrant mutex,
please I beg you, don’t do it
• There’s almost always a way to structure your
code so that you can use a regular mutex
24. Examine
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end Anything wrong
with this?
26. • Take a look at the first line in mon_enter:
if @mon_owner != Thread.current
• Modified by multiple threads
• Read by other threads without being locked
• Read access needs a mutex too
Examine
27. Aside: Double Checked Locking
• Many people have gotten this wrong
• Doug Schmidt & Co, ACE C++
• Pattern-Oriented Software Architecture
(Volume 2, April 2001)
• Popularised a pattern that was completely
broken: Double Checked Locking
28. • A variable shared between multiple threads...
• ...Modified by one or more threads
• You need to use a mutex around the
modification (of course)
• But you also need to a mutex around any
READ access to that variable
Aside: Takeaway
GIL?
29. Aside: Takeaway
This is because of…
• Instruction pipelining
• Multiple levels of chip caches
• Out of order memory references
• The memory model of the platform
• The memory model of the language
30. Examine
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end
def mon_exit
mon_check_owner
@mon_count -=1
if @mon_count == 0
@mon_owner = nil
@mon_mutex.unlock
end
end
31. Examine
So that's two concerning things so far:
1. Logger's rescue of Exception
1. Read access to @mon_owner outside of any
mutex
33. Experiment 1
The Change:
• Puts all access to @mon_owner and @mon_count
(& the Thread ID)
The Result:
• Deadlock
• I saw @mon_count changing from 0 to 2
34. The Change:
• Keep track of @mon_count and @mon_owner in
a list in memory (& the Thread ID)
• Puts the list when we dump the stacktraces
Experiment 2
The Result:
• Deadlock
• @mon_count changing from 0 to 2 (same)
35. The Change:
• @mon_owner and @mon_count don’t really need
to be shared among threads
• Use thread local variables instead
Experiment 3
The Result:
• Deadlock
• @mon_count jumps from 0 to 2 occasionally
36. The Change:
• When a thread acquires the monitor mutex
@mon_count should always be zero
• So check to see if it’s ever non-zero
Experiment 4
37. Experiment 4
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
if @mon_count != 0
puts '=========XXXXXXXXXX======='
end
end
@mon_count += 1
end
38. Experiment 4
The Result:
• Test again → No Deadlock → No log line
• OK that's really odd…
• But you can't rely on a negative, so then I
removed those lines and ran again
• Now it locks
39. Experiment 4
The Result:
• Add back the lines → Doesn't lock
• Remove the lines → Deadlocks quickly
• Hmm, ok that's definitely odd, feels like a
memory visibility issue
40. The Change:
• Download and build a debug version of MRI
• Looking in thread.c I found:
rb_threadptr_unlock_all_locking_mutexes()
with the following warning commented out:
Experiment 5
/* rb_warn("mutex #<%p> remains to be locked
by terminated thread", mutexes); */
42. Experiment 6
The Change:
• Examining mon_enter and mon_exit we can
see that when the lock is taken @mon_count
should always be zero
• But we saw @mon_count jumping from 0 to 2 so
let’s try putting in @mon_count = 0 explicitly
43. Experiment 6
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
@mon_count = 0
end
@mon_count += 1
end
44. Experiment 6
The Result:
• Doesn't lock, left it running for hours
• Take out the @mon_count = 0 and it locks
• But remember checking if @mon_count != 0
had the same effect
45. Experiment 6
• So it seems that adding
@mon_count = 0
"fixes" the problem
• I still want to understand the cause
• I’d like a reproducible test case that
doesn’t rely on our service
46. Experiment 7
• With new threads coming and going, some
exiting normally, some timing out, all emitting
log messages
• What about if we try to log heavily within a
timeout block and time it out in a bunch of
threads
47. def run
count = 1
begin
Timeout.timeout(1) do
loop do
@logger.error("#{Thread.current}: Loop #{count}")
count += 1
end
end
rescue Exception
@logger.error("#{Thread.current}: Exception #{count}")
end
end
Experiment 7
48. • So with this code I get:
... `join': No live threads left.
Deadlock? (fatal)
• Happens every time after a few seconds
Experiment 7
49. Experiment 7
• Since it says all threads are dead, what
happens if there is another thread just sitting
there doing nothing?
• Add
Thread.new { loop { sleep 1 } }
50. Experiment 7
• Run the code again → Deadlock
• All threads stuck in the same location as before:
.../monitor.rb:185:in `lock'
.../monitor.rb:185:in `mon_enter'
.../monitor.rb:210:in `mon_synchronize'
.../logger.rb:559:in `write'
51. • So now I have a simple test case that
reproduces the issue every time
• I can also confirm that adding @mon_count = 0
into mon_enter "fixes" the problem
Experiment 7
52. Examine Again
• At some point during all of this I showed this to
a colleague who suggested I look for recent
changes in this code within the Ruby repo
• We checked the Ruby git repo...
53. Examine Again
commit 7be5169804ee0cfe1991903fa10c31f8bd6525bd
Author: shugo <shugo@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>
Date: Mon May 18 04:56:22 2015 +0000
* lib/monitor.rb (mon_try_enter, mon_enter): should reset
@mon_count just in case the previous owner thread dies
without mon_exit.
[fix GH-874] Patch by @chrisberkhout
55. Root Cause Analysis
• With a little more thought I realised what
the root cause of this problem is…
• It’s the Timeout module and how corrupts
state in the monitor object
56. Root Cause Analysis
class Monitor
def synchronize
mon_enter
begin
yield
ensure
mon_exit
end
end
end
Timeout.timeout(seconds) do
logger.write
end
class Logger
def write
@mon.synchronize do
write-log
end
end
end
1
2
4
5
3
57. Root Cause Analysis
Thread 1 (T1)
• Timeout.timeout(seconds)
• Start a new thread T2
• logger.write
• mon.synchronize
– write-the-log
• Kill T2
Thread 2 (T2)
• Keeps a reference to T1
• sleep(seconds)
• Raise a Timeout
exception against T1
1
2
65. Results From Different Ruby VMs
JRuby
1.6.8
JRuby
1.7.11
JRuby
1.7.19
JRuby
9.0.0.0.pre1
Deadlock Yes
No Deadlock or Starvation.
Though only because the Timeout
exception is not raisedStarvation Yes
66. Results From Different Ruby VMs
RBX 2.4.1 RBX 2.5.2
Deadlock VM Crashes
Yes.Though mostly
thread starvation
Starvation VM Crashes Yes
67. My Assertion
It is fundamentally unsafe to interrupt
a running thread in the general case
Side Effects
69. Thread Cancellation in C
• The C pthread API offers additional features
• Has the concept of thread cancellation:
• Enable / Disable thread cancellation requests
• User defined, per thread cleanup handlers
70. Thread Cancellation in Java
Why is Thread.stop deprecated?
• Because it is inherently unsafe
• Stopping a thread causes it to unlock all the
monitors that it has locked
• If any of the objects previously protected by these
monitors were in an inconsistent state, other
threads may now view these objects in an
inconsistent state
http://docs.oracle.com/javase/1.5.0/docs/guide/misc/threadPrimitiveDeprecation.html
71. “As of JDK8, Thread.stop is really gone. It is the
first deprecated method to have actually been
de-implemented. It now just throws
UnsupportedOperationException”
Doug Lea, Java Concurrency Guru
http://cs.oswego.edu/pipermail/concurrency-interest/2013-December/012028.html
Java Thread Cancellation
72. Ruby rant: Timeout::Error
Jan 2008
(http://goo.gl/PLxR76)
Ruby's Thread#raise, Thread#kill, timeout.rb, and
net/protocol.rb libraries are broken
February 2008
(http://goo.gl/DI8GMX)
Ruby timeouts are dangerous
March 2013
(https://goo.gl/3EoTM6)
Ruby’s Most Dangerous API
May 2015
(http://goo.gl/2RkFbn)Why Ruby’s Timeout is dangerous
(and Thread.raise is terrifying)
Nov 2015
(http://goo.gl/xLvuWG)
Reliable Ruby timeouts for M.R.I. 1.8
https://github.com/ph7/system-timer
Fixing Ruby's standard library Timeout
https://github.com/jjb/sane_timeout
A safer alternative to Ruby's Timeout that
uses unix processes instead of threads
https://github.com/david-mccullars/safe_timeout
Better timeout management for ruby
https://github.com/ryanking/deadline
What Does The Community Say?
73. Threading Quick Links
John Ousterhout:
http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf
Ed Lee:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
• Best practices, smart people, code locked up
after running successfully with minimal changes
for four years
74. Takeaways
• Try to avoid writing multi-threaded code
• Try to avoid reentrant mutexes
• Always use a mutex for read access to
shared state
• Don’t use the Timeout module. It’s a
broken concept