SlideShare a Scribd company logo
1 of 75
Louis Dunne
Principal Software Engineer @ Workday
Finding Concurrency Problems
in Core Ruby Libraries
Structure of the Talk
• Signals & Stacktraces
• Reproduce / Examine / Experiment
• Root Cause Analysis
• Results From MRI, JRuby, RBX
• Lessons from Other Languages
Signals & Stacktraces
Signal Handlers in 2.0
• Can't use a mutex
• Hard to share state safely
reader, writer = IO.pipe
# writer.puts won’t block
%w(INT USR2).each do |sig|
Signal.trap(sig) { writer.puts(sig) }
end
Thread.new { signal_thread(reader) }
Signals & Stacktraces
# signal_thread...
sig = reader.gets.chomp # This will block
...
Thread.list.each do |thd|
puts(thd.backtrace.join)
end
Signals & Stacktraces
Reproduce, Examine, Experiment
Reproduce
Examine
Experiment
Reproduce It
• A lot of effort but essential
• The easier you can reproduce it
• The easier you can debug it
Examine The Code
• Multi-threaded principles
• Anything obvious?
• You still need to experiment
and prove your case
Experiment
• Start running experiments
• See if your expectations match reality
• Keep a written log
Reproduce
Reproduce
Reproduce
• Start simple...
• Start 100 clients doing a list operation
while true; do
date
echo "Starting 100 requests..."
for i in {1..100}; do
<rest-client-list-operation> &
done
wait
done
Reproduce
Reproduce
• On my laptop → No lockup
• On a real server → No lockup
• Need to try both
• More concurrency
• A dependency verification thread
• Run this every second
• Test again for 30 minutes → No lockup
Reproduce
• We deploy to an OpenStack cluster
• What if we do nothing and return early
• Run the test again for 30 minutes
→ No lockup
Reproduce
Reproduce
Timeout.timeout(job.timeout_seconds) do
run_job(job)
end
• Set job.timeout_seconds to 1
→ Deadlock!
Reproduce
.../monitor.rb:185:in `lock'
.../monitor.rb:185:in `mon_enter'
.../monitor.rb:210:in `mon_synchronize'
.../logger.rb:559:in `write'
Examine
Examine
Examine
def write(message)
begin
@mutex.synchronize do
# write-the-log-line
end
rescue Exception
# log-a-warning
end
end
Anything wrong
with this?
Examine
• mon_synchronize
• mon_enter
• mon_exit
• mon_check_owner
Examine
def mon_synchronize
mon_enter
begin
yield
ensure
mon_exit
end
end
Looks OK
Examine
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end
@mon_mutex = Mutex.new
@mon_owner = nil
@mon_count = 0
Grrrrr...
Rant
• I’ve debugged locks involving reentrant
mutexes more times than I can remember
• If you ever feel like using a reentrant mutex,
please I beg you, don’t do it
• There’s almost always a way to structure your
code so that you can use a regular mutex
Examine
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end Anything wrong
with this?
Examine
def mon_exit
mon_check_owner
@mon_count -=1
if @mon_count == 0
@mon_owner = nil
@mon_mutex.unlock
end
end
Looks OK
if @mon_owner != Thread.current
raise...
• Take a look at the first line in mon_enter:
if @mon_owner != Thread.current
• Modified by multiple threads
• Read by other threads without being locked
• Read access needs a mutex too
Examine
Aside: Double Checked Locking
• Many people have gotten this wrong
• Doug Schmidt & Co, ACE C++
• Pattern-Oriented Software Architecture
(Volume 2, April 2001)
• Popularised a pattern that was completely
broken: Double Checked Locking
• A variable shared between multiple threads...
• ...Modified by one or more threads
• You need to use a mutex around the
modification (of course)
• But you also need to a mutex around any
READ access to that variable
Aside: Takeaway
GIL?
Aside: Takeaway
This is because of…
• Instruction pipelining
• Multiple levels of chip caches
• Out of order memory references
• The memory model of the platform
• The memory model of the language
Examine
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end
def mon_exit
mon_check_owner
@mon_count -=1
if @mon_count == 0
@mon_owner = nil
@mon_mutex.unlock
end
end
Examine
So that's two concerning things so far:
1. Logger's rescue of Exception
1. Read access to @mon_owner outside of any
mutex
Experiment
Experiment
Experiment 1
The Change:
• Puts all access to @mon_owner and @mon_count
(& the Thread ID)
The Result:
• Deadlock
• I saw @mon_count changing from 0 to 2
The Change:
• Keep track of @mon_count and @mon_owner in
a list in memory (& the Thread ID)
• Puts the list when we dump the stacktraces
Experiment 2
The Result:
• Deadlock
• @mon_count changing from 0 to 2 (same)
The Change:
• @mon_owner and @mon_count don’t really need
to be shared among threads
• Use thread local variables instead
Experiment 3
The Result:
• Deadlock
• @mon_count jumps from 0 to 2 occasionally
The Change:
• When a thread acquires the monitor mutex
@mon_count should always be zero
• So check to see if it’s ever non-zero
Experiment 4
Experiment 4
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
if @mon_count != 0
puts '=========XXXXXXXXXX======='
end
end
@mon_count += 1
end
Experiment 4
The Result:
• Test again → No Deadlock → No log line
• OK that's really odd…
• But you can't rely on a negative, so then I
removed those lines and ran again
• Now it locks
Experiment 4
The Result:
• Add back the lines → Doesn't lock
• Remove the lines → Deadlocks quickly
• Hmm, ok that's definitely odd, feels like a
memory visibility issue
The Change:
• Download and build a debug version of MRI
• Looking in thread.c I found:
rb_threadptr_unlock_all_locking_mutexes()
with the following warning commented out:
Experiment 5
/* rb_warn("mutex #<%p> remains to be locked
by terminated thread", mutexes); */
The Result:
• Deadlocks
• Saw threads exiting with that warning about
locked mutexes
Experiment 5
Experiment 6
The Change:
• Examining mon_enter and mon_exit we can
see that when the lock is taken @mon_count
should always be zero
• But we saw @mon_count jumping from 0 to 2 so
let’s try putting in @mon_count = 0 explicitly
Experiment 6
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
@mon_count = 0
end
@mon_count += 1
end
Experiment 6
The Result:
• Doesn't lock, left it running for hours
• Take out the @mon_count = 0 and it locks
• But remember checking if @mon_count != 0
had the same effect
Experiment 6
• So it seems that adding
@mon_count = 0
"fixes" the problem
• I still want to understand the cause
• I’d like a reproducible test case that
doesn’t rely on our service
Experiment 7
• With new threads coming and going, some
exiting normally, some timing out, all emitting
log messages
• What about if we try to log heavily within a
timeout block and time it out in a bunch of
threads
def run
count = 1
begin
Timeout.timeout(1) do
loop do
@logger.error("#{Thread.current}: Loop #{count}")
count += 1
end
end
rescue Exception
@logger.error("#{Thread.current}: Exception #{count}")
end
end
Experiment 7
• So with this code I get:
... `join': No live threads left.
Deadlock? (fatal)
• Happens every time after a few seconds
Experiment 7
Experiment 7
• Since it says all threads are dead, what
happens if there is another thread just sitting
there doing nothing?
• Add
Thread.new { loop { sleep 1 } }
Experiment 7
• Run the code again → Deadlock
• All threads stuck in the same location as before:
.../monitor.rb:185:in `lock'
.../monitor.rb:185:in `mon_enter'
.../monitor.rb:210:in `mon_synchronize'
.../logger.rb:559:in `write'
• So now I have a simple test case that
reproduces the issue every time
• I can also confirm that adding @mon_count = 0
into mon_enter "fixes" the problem
Experiment 7
Examine Again
• At some point during all of this I showed this to
a colleague who suggested I look for recent
changes in this code within the Ruby repo
• We checked the Ruby git repo...
Examine Again
commit 7be5169804ee0cfe1991903fa10c31f8bd6525bd
Author: shugo <shugo@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>
Date: Mon May 18 04:56:22 2015 +0000
* lib/monitor.rb (mon_try_enter, mon_enter): should reset
@mon_count just in case the previous owner thread dies
without mon_exit.
[fix GH-874] Patch by @chrisberkhout
Root Cause Analysis
Root Cause
Analysis
Root Cause Analysis
• With a little more thought I realised what
the root cause of this problem is…
• It’s the Timeout module and how corrupts
state in the monitor object
Root Cause Analysis
class Monitor
def synchronize
mon_enter
begin
yield
ensure
mon_exit
end
end
end
Timeout.timeout(seconds) do
logger.write
end
class Logger
def write
@mon.synchronize do
write-log
end
end
end
1
2
4
5
3
Root Cause Analysis
Thread 1 (T1)
• Timeout.timeout(seconds)
• Start a new thread T2
• logger.write
• mon.synchronize
– write-the-log
• Kill T2
Thread 2 (T2)
• Keeps a reference to T1
• sleep(seconds)
• Raise a Timeout
exception against T1
1
2
class Monitor
def synchronize
mon_enter
begin
yield
ensure
mon_exit
end
end
end
Root Cause Analysis
• What about right here?
• mon_enter is invoked
• mon_exit is not
def mon_enter
if @mon_owner != Thread.current
@mon_mutex.lock
@mon_owner = Thread.current
end
@mon_count += 1
end
Root Cause Analysis
Root Cause Analysis
def mon_exit
mon_check_owner
@mon_count -=1
if @mon_count == 0
@mon_owner = nil
@mon_mutex.unlock
end
end
Finally!
It all makes sense
Demo Time!
https://github.com/lad/ruby_concurrency
• thread_deadlock.rb
• thread_starve.rb
Show Me The Code
Results From Different Ruby VMs
MRI 1.8.7 MRI 1.9.3 MRI 2.1.5 MRI (HEAD)
Deadlock Yes Yes Yes No (*)
Starvation Yes
Can’t say.
Always
deadlocks
Yes Yes
Mid Jan, 2016 (2.3+)
Results From Different Ruby VMs
JRuby
1.6.8
JRuby
1.7.11
JRuby
1.7.19
JRuby
9.0.0.0.pre1
Deadlock Yes
No Deadlock or Starvation.
Though only because the Timeout
exception is not raisedStarvation Yes
Results From Different Ruby VMs
RBX 2.4.1 RBX 2.5.2
Deadlock VM Crashes
Yes.Though mostly
thread starvation
Starvation VM Crashes Yes
My Assertion
It is fundamentally unsafe to interrupt
a running thread in the general case
Side Effects
Other Languages
Lessons From
Other Languages
Thread Cancellation in C
• The C pthread API offers additional features
• Has the concept of thread cancellation:
• Enable / Disable thread cancellation requests
• User defined, per thread cleanup handlers
Thread Cancellation in Java
Why is Thread.stop deprecated?
• Because it is inherently unsafe
• Stopping a thread causes it to unlock all the
monitors that it has locked
• If any of the objects previously protected by these
monitors were in an inconsistent state, other
threads may now view these objects in an
inconsistent state
http://docs.oracle.com/javase/1.5.0/docs/guide/misc/threadPrimitiveDeprecation.html
“As of JDK8, Thread.stop is really gone. It is the
first deprecated method to have actually been
de-implemented. It now just throws
UnsupportedOperationException”
Doug Lea, Java Concurrency Guru
http://cs.oswego.edu/pipermail/concurrency-interest/2013-December/012028.html
Java Thread Cancellation
Ruby rant: Timeout::Error
Jan 2008
(http://goo.gl/PLxR76)
Ruby's Thread#raise, Thread#kill, timeout.rb, and
net/protocol.rb libraries are broken
February 2008
(http://goo.gl/DI8GMX)
Ruby timeouts are dangerous
March 2013
(https://goo.gl/3EoTM6)
Ruby’s Most Dangerous API
May 2015
(http://goo.gl/2RkFbn)Why Ruby’s Timeout is dangerous
(and Thread.raise is terrifying)
Nov 2015
(http://goo.gl/xLvuWG)
Reliable Ruby timeouts for M.R.I. 1.8
https://github.com/ph7/system-timer
Fixing Ruby's standard library Timeout
https://github.com/jjb/sane_timeout
A safer alternative to Ruby's Timeout that
uses unix processes instead of threads
https://github.com/david-mccullars/safe_timeout
Better timeout management for ruby
https://github.com/ryanking/deadline
What Does The Community Say?
Threading Quick Links
John Ousterhout:
http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf
Ed Lee:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
• Best practices, smart people, code locked up
after running successfully with minimal changes
for four years
Takeaways
• Try to avoid writing multi-threaded code
• Try to avoid reentrant mutexes
• Always use a mutex for read access to
shared state
• Don’t use the Timeout module. It’s a
broken concept
Finding concurrency problems in core ruby libraries

More Related Content

Similar to Finding concurrency problems in core ruby libraries

Synchronization problem with threads
Synchronization problem with threadsSynchronization problem with threads
Synchronization problem with threadsSyed Zaid Irshad
 
Test First Teaching
Test First TeachingTest First Teaching
Test First TeachingSarah Allen
 
Case Study of the Unexplained
Case Study of the UnexplainedCase Study of the Unexplained
Case Study of the Unexplainedshannomc
 
Cooking a rabbit pie
Cooking a rabbit pieCooking a rabbit pie
Cooking a rabbit pieTomas Doran
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code ClinicMike Acton
 
Constant Blocks in Pharo11
Constant Blocks in Pharo11 Constant Blocks in Pharo11
Constant Blocks in Pharo11 ESUG
 
ConstantBlocks in Pharo11
ConstantBlocks in Pharo11ConstantBlocks in Pharo11
ConstantBlocks in Pharo11Marcus Denker
 
Coderetreat @Sibiu 2012 08 18
Coderetreat @Sibiu 2012 08 18Coderetreat @Sibiu 2012 08 18
Coderetreat @Sibiu 2012 08 18Adi Bolboaca
 
Qt multi threads
Qt multi threadsQt multi threads
Qt multi threadsYnon Perek
 
Introduction to Python programming 1.pptx
Introduction to Python programming 1.pptxIntroduction to Python programming 1.pptx
Introduction to Python programming 1.pptxJoshuaAnnan5
 
Avoiding Common Pitfalls in Ember.js
Avoiding Common Pitfalls in Ember.jsAvoiding Common Pitfalls in Ember.js
Avoiding Common Pitfalls in Ember.jsAlex Speller
 
Coderetreat @AgileFinland Tampere 2014 11 12
Coderetreat @AgileFinland Tampere 2014 11 12Coderetreat @AgileFinland Tampere 2014 11 12
Coderetreat @AgileFinland Tampere 2014 11 12Adi Bolboaca
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in PythonMosky Liu
 
TDD Walkthrough - Encryption
TDD Walkthrough - EncryptionTDD Walkthrough - Encryption
TDD Walkthrough - EncryptionPeterKha2
 
Test First Teaching and the path to TDD
Test First Teaching and the path to TDDTest First Teaching and the path to TDD
Test First Teaching and the path to TDDSarah Allen
 
RoelTyper
RoelTyperRoelTyper
RoelTyperESUG
 
Going loopy - Introduction to Loops.pptx
Going loopy - Introduction to Loops.pptxGoing loopy - Introduction to Loops.pptx
Going loopy - Introduction to Loops.pptxAmy Nightingale
 

Similar to Finding concurrency problems in core ruby libraries (20)

Synchronization problem with threads
Synchronization problem with threadsSynchronization problem with threads
Synchronization problem with threads
 
Test First Teaching
Test First TeachingTest First Teaching
Test First Teaching
 
Deja vu JavaZone 2013
Deja vu  JavaZone 2013Deja vu  JavaZone 2013
Deja vu JavaZone 2013
 
Case Study of the Unexplained
Case Study of the UnexplainedCase Study of the Unexplained
Case Study of the Unexplained
 
Cooking a rabbit pie
Cooking a rabbit pieCooking a rabbit pie
Cooking a rabbit pie
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Constant Blocks in Pharo11
Constant Blocks in Pharo11 Constant Blocks in Pharo11
Constant Blocks in Pharo11
 
ConstantBlocks in Pharo11
ConstantBlocks in Pharo11ConstantBlocks in Pharo11
ConstantBlocks in Pharo11
 
Coderetreat @Sibiu 2012 08 18
Coderetreat @Sibiu 2012 08 18Coderetreat @Sibiu 2012 08 18
Coderetreat @Sibiu 2012 08 18
 
Qt multi threads
Qt multi threadsQt multi threads
Qt multi threads
 
Introduction to Python programming 1.pptx
Introduction to Python programming 1.pptxIntroduction to Python programming 1.pptx
Introduction to Python programming 1.pptx
 
Sumatra and git
Sumatra and gitSumatra and git
Sumatra and git
 
Avoiding Common Pitfalls in Ember.js
Avoiding Common Pitfalls in Ember.jsAvoiding Common Pitfalls in Ember.js
Avoiding Common Pitfalls in Ember.js
 
Coderetreat @AgileFinland Tampere 2014 11 12
Coderetreat @AgileFinland Tampere 2014 11 12Coderetreat @AgileFinland Tampere 2014 11 12
Coderetreat @AgileFinland Tampere 2014 11 12
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
TDD Walkthrough - Encryption
TDD Walkthrough - EncryptionTDD Walkthrough - Encryption
TDD Walkthrough - Encryption
 
Test First Teaching and the path to TDD
Test First Teaching and the path to TDDTest First Teaching and the path to TDD
Test First Teaching and the path to TDD
 
RoelTyper
RoelTyperRoelTyper
RoelTyper
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Going loopy - Introduction to Loops.pptx
Going loopy - Introduction to Loops.pptxGoing loopy - Introduction to Loops.pptx
Going loopy - Introduction to Loops.pptx
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Finding concurrency problems in core ruby libraries

  • 1. Louis Dunne Principal Software Engineer @ Workday Finding Concurrency Problems in Core Ruby Libraries
  • 2. Structure of the Talk • Signals & Stacktraces • Reproduce / Examine / Experiment • Root Cause Analysis • Results From MRI, JRuby, RBX • Lessons from Other Languages
  • 3. Signals & Stacktraces Signal Handlers in 2.0 • Can't use a mutex • Hard to share state safely
  • 4. reader, writer = IO.pipe # writer.puts won’t block %w(INT USR2).each do |sig| Signal.trap(sig) { writer.puts(sig) } end Thread.new { signal_thread(reader) } Signals & Stacktraces
  • 5. # signal_thread... sig = reader.gets.chomp # This will block ... Thread.list.each do |thd| puts(thd.backtrace.join) end Signals & Stacktraces
  • 7. Reproduce It • A lot of effort but essential • The easier you can reproduce it • The easier you can debug it
  • 8. Examine The Code • Multi-threaded principles • Anything obvious? • You still need to experiment and prove your case
  • 9. Experiment • Start running experiments • See if your expectations match reality • Keep a written log
  • 11. Reproduce • Start simple... • Start 100 clients doing a list operation
  • 12. while true; do date echo "Starting 100 requests..." for i in {1..100}; do <rest-client-list-operation> & done wait done Reproduce
  • 13. Reproduce • On my laptop → No lockup • On a real server → No lockup • Need to try both
  • 14. • More concurrency • A dependency verification thread • Run this every second • Test again for 30 minutes → No lockup Reproduce
  • 15. • We deploy to an OpenStack cluster • What if we do nothing and return early • Run the test again for 30 minutes → No lockup Reproduce
  • 19. Examine def write(message) begin @mutex.synchronize do # write-the-log-line end rescue Exception # log-a-warning end end Anything wrong with this?
  • 20. Examine • mon_synchronize • mon_enter • mon_exit • mon_check_owner
  • 22. Examine def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end @mon_mutex = Mutex.new @mon_owner = nil @mon_count = 0 Grrrrr...
  • 23. Rant • I’ve debugged locks involving reentrant mutexes more times than I can remember • If you ever feel like using a reentrant mutex, please I beg you, don’t do it • There’s almost always a way to structure your code so that you can use a regular mutex
  • 24. Examine def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end Anything wrong with this?
  • 25. Examine def mon_exit mon_check_owner @mon_count -=1 if @mon_count == 0 @mon_owner = nil @mon_mutex.unlock end end Looks OK if @mon_owner != Thread.current raise...
  • 26. • Take a look at the first line in mon_enter: if @mon_owner != Thread.current • Modified by multiple threads • Read by other threads without being locked • Read access needs a mutex too Examine
  • 27. Aside: Double Checked Locking • Many people have gotten this wrong • Doug Schmidt & Co, ACE C++ • Pattern-Oriented Software Architecture (Volume 2, April 2001) • Popularised a pattern that was completely broken: Double Checked Locking
  • 28. • A variable shared between multiple threads... • ...Modified by one or more threads • You need to use a mutex around the modification (of course) • But you also need to a mutex around any READ access to that variable Aside: Takeaway GIL?
  • 29. Aside: Takeaway This is because of… • Instruction pipelining • Multiple levels of chip caches • Out of order memory references • The memory model of the platform • The memory model of the language
  • 30. Examine def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end def mon_exit mon_check_owner @mon_count -=1 if @mon_count == 0 @mon_owner = nil @mon_mutex.unlock end end
  • 31. Examine So that's two concerning things so far: 1. Logger's rescue of Exception 1. Read access to @mon_owner outside of any mutex
  • 33. Experiment 1 The Change: • Puts all access to @mon_owner and @mon_count (& the Thread ID) The Result: • Deadlock • I saw @mon_count changing from 0 to 2
  • 34. The Change: • Keep track of @mon_count and @mon_owner in a list in memory (& the Thread ID) • Puts the list when we dump the stacktraces Experiment 2 The Result: • Deadlock • @mon_count changing from 0 to 2 (same)
  • 35. The Change: • @mon_owner and @mon_count don’t really need to be shared among threads • Use thread local variables instead Experiment 3 The Result: • Deadlock • @mon_count jumps from 0 to 2 occasionally
  • 36. The Change: • When a thread acquires the monitor mutex @mon_count should always be zero • So check to see if it’s ever non-zero Experiment 4
  • 37. Experiment 4 def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current if @mon_count != 0 puts '=========XXXXXXXXXX=======' end end @mon_count += 1 end
  • 38. Experiment 4 The Result: • Test again → No Deadlock → No log line • OK that's really odd… • But you can't rely on a negative, so then I removed those lines and ran again • Now it locks
  • 39. Experiment 4 The Result: • Add back the lines → Doesn't lock • Remove the lines → Deadlocks quickly • Hmm, ok that's definitely odd, feels like a memory visibility issue
  • 40. The Change: • Download and build a debug version of MRI • Looking in thread.c I found: rb_threadptr_unlock_all_locking_mutexes() with the following warning commented out: Experiment 5 /* rb_warn("mutex #<%p> remains to be locked by terminated thread", mutexes); */
  • 41. The Result: • Deadlocks • Saw threads exiting with that warning about locked mutexes Experiment 5
  • 42. Experiment 6 The Change: • Examining mon_enter and mon_exit we can see that when the lock is taken @mon_count should always be zero • But we saw @mon_count jumping from 0 to 2 so let’s try putting in @mon_count = 0 explicitly
  • 43. Experiment 6 def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current @mon_count = 0 end @mon_count += 1 end
  • 44. Experiment 6 The Result: • Doesn't lock, left it running for hours • Take out the @mon_count = 0 and it locks • But remember checking if @mon_count != 0 had the same effect
  • 45. Experiment 6 • So it seems that adding @mon_count = 0 "fixes" the problem • I still want to understand the cause • I’d like a reproducible test case that doesn’t rely on our service
  • 46. Experiment 7 • With new threads coming and going, some exiting normally, some timing out, all emitting log messages • What about if we try to log heavily within a timeout block and time it out in a bunch of threads
  • 47. def run count = 1 begin Timeout.timeout(1) do loop do @logger.error("#{Thread.current}: Loop #{count}") count += 1 end end rescue Exception @logger.error("#{Thread.current}: Exception #{count}") end end Experiment 7
  • 48. • So with this code I get: ... `join': No live threads left. Deadlock? (fatal) • Happens every time after a few seconds Experiment 7
  • 49. Experiment 7 • Since it says all threads are dead, what happens if there is another thread just sitting there doing nothing? • Add Thread.new { loop { sleep 1 } }
  • 50. Experiment 7 • Run the code again → Deadlock • All threads stuck in the same location as before: .../monitor.rb:185:in `lock' .../monitor.rb:185:in `mon_enter' .../monitor.rb:210:in `mon_synchronize' .../logger.rb:559:in `write'
  • 51. • So now I have a simple test case that reproduces the issue every time • I can also confirm that adding @mon_count = 0 into mon_enter "fixes" the problem Experiment 7
  • 52. Examine Again • At some point during all of this I showed this to a colleague who suggested I look for recent changes in this code within the Ruby repo • We checked the Ruby git repo...
  • 53. Examine Again commit 7be5169804ee0cfe1991903fa10c31f8bd6525bd Author: shugo <shugo@b2dd03c8-39d4-4d8f-98ff-823fe69b080e> Date: Mon May 18 04:56:22 2015 +0000 * lib/monitor.rb (mon_try_enter, mon_enter): should reset @mon_count just in case the previous owner thread dies without mon_exit. [fix GH-874] Patch by @chrisberkhout
  • 54. Root Cause Analysis Root Cause Analysis
  • 55. Root Cause Analysis • With a little more thought I realised what the root cause of this problem is… • It’s the Timeout module and how corrupts state in the monitor object
  • 56. Root Cause Analysis class Monitor def synchronize mon_enter begin yield ensure mon_exit end end end Timeout.timeout(seconds) do logger.write end class Logger def write @mon.synchronize do write-log end end end 1 2 4 5 3
  • 57. Root Cause Analysis Thread 1 (T1) • Timeout.timeout(seconds) • Start a new thread T2 • logger.write • mon.synchronize – write-the-log • Kill T2 Thread 2 (T2) • Keeps a reference to T1 • sleep(seconds) • Raise a Timeout exception against T1 1 2
  • 58. class Monitor def synchronize mon_enter begin yield ensure mon_exit end end end Root Cause Analysis • What about right here? • mon_enter is invoked • mon_exit is not
  • 59. def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end Root Cause Analysis
  • 60. Root Cause Analysis def mon_exit mon_check_owner @mon_count -=1 if @mon_count == 0 @mon_owner = nil @mon_mutex.unlock end end
  • 64. Results From Different Ruby VMs MRI 1.8.7 MRI 1.9.3 MRI 2.1.5 MRI (HEAD) Deadlock Yes Yes Yes No (*) Starvation Yes Can’t say. Always deadlocks Yes Yes Mid Jan, 2016 (2.3+)
  • 65. Results From Different Ruby VMs JRuby 1.6.8 JRuby 1.7.11 JRuby 1.7.19 JRuby 9.0.0.0.pre1 Deadlock Yes No Deadlock or Starvation. Though only because the Timeout exception is not raisedStarvation Yes
  • 66. Results From Different Ruby VMs RBX 2.4.1 RBX 2.5.2 Deadlock VM Crashes Yes.Though mostly thread starvation Starvation VM Crashes Yes
  • 67. My Assertion It is fundamentally unsafe to interrupt a running thread in the general case Side Effects
  • 69. Thread Cancellation in C • The C pthread API offers additional features • Has the concept of thread cancellation: • Enable / Disable thread cancellation requests • User defined, per thread cleanup handlers
  • 70. Thread Cancellation in Java Why is Thread.stop deprecated? • Because it is inherently unsafe • Stopping a thread causes it to unlock all the monitors that it has locked • If any of the objects previously protected by these monitors were in an inconsistent state, other threads may now view these objects in an inconsistent state http://docs.oracle.com/javase/1.5.0/docs/guide/misc/threadPrimitiveDeprecation.html
  • 71. “As of JDK8, Thread.stop is really gone. It is the first deprecated method to have actually been de-implemented. It now just throws UnsupportedOperationException” Doug Lea, Java Concurrency Guru http://cs.oswego.edu/pipermail/concurrency-interest/2013-December/012028.html Java Thread Cancellation
  • 72. Ruby rant: Timeout::Error Jan 2008 (http://goo.gl/PLxR76) Ruby's Thread#raise, Thread#kill, timeout.rb, and net/protocol.rb libraries are broken February 2008 (http://goo.gl/DI8GMX) Ruby timeouts are dangerous March 2013 (https://goo.gl/3EoTM6) Ruby’s Most Dangerous API May 2015 (http://goo.gl/2RkFbn)Why Ruby’s Timeout is dangerous (and Thread.raise is terrifying) Nov 2015 (http://goo.gl/xLvuWG) Reliable Ruby timeouts for M.R.I. 1.8 https://github.com/ph7/system-timer Fixing Ruby's standard library Timeout https://github.com/jjb/sane_timeout A safer alternative to Ruby's Timeout that uses unix processes instead of threads https://github.com/david-mccullars/safe_timeout Better timeout management for ruby https://github.com/ryanking/deadline What Does The Community Say?
  • 73. Threading Quick Links John Ousterhout: http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf Ed Lee: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf • Best practices, smart people, code locked up after running successfully with minimal changes for four years
  • 74. Takeaways • Try to avoid writing multi-threaded code • Try to avoid reentrant mutexes • Always use a mutex for read access to shared state • Don’t use the Timeout module. It’s a broken concept