Successfully reported this slideshow.
Upcoming SlideShare
×

# In Search of the Perfect Global Interpreter Lock

52,747 views

Published on

Presentation on the Python/Ruby Global Interpreter Lock at RuPy 2011. October 14, 2011. Poznan, Poland.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No

### In Search of the Perfect Global Interpreter Lock

1. 1. In Search of the Perfect Global Interpreter Lock David Beazley http://www.dabeaz.com @dabeaz October 15, 2011 Presented at RuPy 2011 Poznan, PolandCopyright (C) 2010, David Beazley, http://www.dabeaz.com 1
2. 2. Introduction • As many programmers know, Python and Ruby feature a Global Interpreter Lock (GIL) • More precise: CPython and MRI • It limits thread performance on multicore • Theoretically restricts code to a single CPUCopyright (C) 2010, David Beazley, http://www.dabeaz.com 2
3. 3. An Experiment • Consider a trivial CPU-bound function def countdown(n): while n > 0: n -= 1 • Run it once with a lot of work COUNT = 100000000 # 100 million countdown(COUNT) • Now, divide the work across two threads t1 = Thread(target=count,args=(COUNT//2,)) t2 = Thread(target=count,args=(COUNT//2,)) t1.start(); t2.start() t1.join(); t2.join()Copyright (C) 2010, David Beazley, http://www.dabeaz.com 3
4. 4. An Experiment • Some Ruby def countdown(n) while n > 0 n -= 1 end end • Sequential COUNT = 100000000 # 100 million countdown(COUNT) • Subdivided across threads t1 = Thread.new { countdown(COUNT/2) } t2 = Thread.new { countdown(COUNT/2) } t1.join t2.joinCopyright (C) 2010, David Beazley, http://www.dabeaz.com 4
5. 5. Expectations • Sequential and threaded versions perform the same amount of work (same # calculations) • There is the GIL... so no parallelism • Performance should be about the sameCopyright (C) 2010, David Beazley, http://www.dabeaz.com 5
6. 6. Results • Ruby 1.9 on OS-X (4 cores) Sequential : 2.46s Threaded (2 threads) : 2.55s (~ same)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 6
7. 7. Results • Ruby 1.9 on OS-X (4 cores) Sequential : 2.46s Threaded (2 threads) : 2.55s (~ same) • Python 2.7 Sequential : 6.12s Threaded (2 threads) : 9.28s (1.5x slower!)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 7
8. 8. Results • Ruby 1.9 on OS-X (4 cores) Sequential : 2.46s Threaded (2 threads) : 2.55s (~ same) • Python 2.7 Sequential : 6.12s Threaded (2 threads) : 9.28s (1.5x slower!) • Question: Why does it get slower in Python?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 8
9. 9. Results • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential : 3.32s Threaded (2 threads) : 3.45s (~ same)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 9
10. 10. Results • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential : 3.32s Threaded (2 threads) : 3.45s (~ same) • Python 2.7 Sequential : 6.9s Threaded (2 threads) : 63.0s (9.1x slower!)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 10
11. 11. Results • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential : 3.32s Threaded (2 threads) : 3.45s (~ same) • Python 2.7 Sequential : 6.9s Threaded (2 threads) : 63.0s (9.1x slower!) • Why does it get that much slower on Windows?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 11
12. 12. Experiment: Messaging • A request/reply server for size-preﬁxed messages Client Server • Each message: a size header + payload • Similar: ZeroMQCopyright (C) 2010, David Beazley, http://www.dabeaz.com 12
13. 13. An Experiment: Messaging • A simple test - message echo (pseudocode) def client(nummsg,msg): def server(): while nummsg > 0: while True: send(msg) msg = recv() resp = recv() send(msg) sleep(0.001) nummsg -= 1Copyright (C) 2010, David Beazley, http://www.dabeaz.com 13
14. 14. An Experiment: Messaging • A simple test - message echo (pseudocode) def client(nummsg,msg): def server(): while nummsg > 0: while True: send(msg) msg = recv() resp = recv() send(msg) sleep(0.001) nummsg -= 1 • To be less evil, its throttled (<1000 msg/sec) • Not a messaging stress testCopyright (C) 2010, David Beazley, http://www.dabeaz.com 14
15. 15. An Experiment: Messaging • A test: send/receive 1000 8K messages • Scenario 1: Unloaded server Client Server • Scenario 2 : Server competing with one CPU-thread CPU-Thread Client ServerCopyright (C) 2010, David Beazley, http://www.dabeaz.com 15
16. 16. Results • Messaging with no threads (OS-X, 4 cores) C : 1.26s Python 2.7 : 1.29s Ruby 1.9 : 1.29sCopyright (C) 2010, David Beazley, http://www.dabeaz.com 16
17. 17. Results • Messaging with no threads (OS-X, 4 cores) C : 1.26s Python 2.7 : 1.29s Ruby 1.9 : 1.29s • Messaging with one CPU-bound thread* C : 1.16s (~8% faster!?) Python 2.7 : 12.3s (10x slower) Ruby 1.9 : 42.0s (33x slower) • Hmmm. Curious. * On Ruby, the CPU-bound thread was also given lower priorityCopyright (C) 2010, David Beazley, http://www.dabeaz.com 17
18. 18. Results • Messaging with no threads (Linux, 8 CPUs) C : 1.13s Python 2.7 : 1.18s Ruby 1.9 : 1.18sCopyright (C) 2010, David Beazley, http://www.dabeaz.com 18
19. 19. Results • Messaging with no threads (Linux, 8 CPUs) C : 1.13s Python 2.7 : 1.18s Ruby 1.9 : 1.18s • Messaging with one CPU-bound thread C : 1.11s (same) Python 2.7 : 1.60s (1.4x slower) - better Ruby 1.9 : 5839.4s (~5000x slower) - worse!Copyright (C) 2010, David Beazley, http://www.dabeaz.com 19
20. 20. Results • Messaging with no threads (Linux, 8 CPUs) C : 1.13s Python 2.7 : 1.18s Ruby 1.9 : 1.18s • Messaging with one CPU-bound thread C : 1.11s (same) Python 2.7 : 1.60s (1.4x slower) - better Ruby 1.9 : 5839.4s (~5000x slower) - worse! • 5000x slower? Really? Why?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 20
21. 21. The Mystery Deepens • Disable all but one CPU core • CPU-bound threads (OS-X) Python 2.7 (4 cores+hyperthreading) : 9.28s Python 2.7 (1 core) : 7.9s (faster!) • Messaging with one CPU-bound thread Ruby 1.9 (4 cores+hyperthreading) : 42.0s Ruby 1.9 (1 core) : 10.5s (much faster!) • ?!?!?!?!?!?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 21
22. 22. Better is Worse • Change software versions • Lets upgrade to Python 3 (Linux) Python 2.7 (Messaging) : 12.3s Python 3.2 (Messaging) : 20.1s (1.6x slower) • Lets downgrade to Ruby 1.8 (Linux) Ruby 1.9 (Messaging) : 42.0 Ruby 1.8.7 (Messaging) : 10.0s (4x faster) • So much for progress (sigh)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 22
23. 23. Whats Happening? • The GIL does far more than limit cores • It can make performance much worse • Better performance by turning off cores? • 5000x performance hit on Linux? • Why?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 23
24. 24. Why You Might Care • Must you abandon Python/Ruby for concurrency? • Having threads restricted to one CPU core might be okay if it were sane • Analogy: A multitasking operating system (e.g., Linux) runs ﬁne on a single CPU • Plus, threads get used a lot behind the scenes (even in thread alternatives, e.g., async)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 24
25. 25. Why I Care • Its an interesting little systems problem • How do you make a better GIL? • Its fun.Copyright (C) 2010, David Beazley, http://www.dabeaz.com 25
26. 26. Some Background • I have been discussing some of these issues in the Python community since 2009 http://www.dabeaz.com/GIL • Im less familiar with Ruby, but Ive looked at its GIL implementation and experimented • Very interested in commonalities/differencesCopyright (C) 2010, David Beazley, http://www.dabeaz.com 26
27. 27. A Tale of Two GILsCopyright (C) 2010, David Beazley, http://www.dabeaz.com 27
28. 28. Thread Implementation • System threads • System threads (e.g., pthreads) (e.g., pthreads) • Managed by OS • Managed by OS • Concurrent • Concurrent execution of the execution of the Python interpreter Ruby VM (written in C) (written in C)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 28
29. 29. Alas, the GIL • Parallel execution is forbidden • There is a "global interpreter lock" • The GIL ensures that only one thread runs in the interpreter at once • Simpliﬁes many low-level details (memory management, callouts to C extensions, etc.)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 29
30. 30. GIL Implementation int gil_locked = 0; mutex_t gil; mutex_t gil_mutex; cond_t gil_cond; void gil_acquire() { mutex_lock(gil); void gil_acquire() { } mutex_lock(gil_mutex); void gil_release() { while (gil_locked) mutex_unlock(gil); cond_wait(gil_cond); } gil_locked = 1; mutex_unlock(gil_mutex); } Simple mutex lock void gil_release() { mutex_lock(gil_mutex); gil_locked = 0; cond_notify(); mutex_unlock(gil_mutex); Condition variable }Copyright (C) 2010, David Beazley, http://www.dabeaz.com 30
31. 31. Thread Execution Model • The GIL results in cooperative multitasking block block block block block Thread 1 run run Thread 2 run run run Thread 3 release acquire release acquire GIL GIL GIL GIL • When a thread is running, it holds the GIL • GIL released on blocking (e.g., I/O operations)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 31
32. 32. Threads for I/O • For I/O it works great • GIL is never held very long • Most threads just sit around sleeping • Life is goodCopyright (C) 2010, David Beazley, http://www.dabeaz.com 32
33. 33. Threads for Computation • You may actually want to compute something! • Fibonacci numbers • Image/audio processing • Parsing • The CPU will be busy • And it wont give up the GIL on its ownCopyright (C) 2010, David Beazley, http://www.dabeaz.com 33
34. 34. CPU-Bound Switching • Releases and • Background thread reacquires the GIL generates a timer every 100 "ticks" interrupt every 10ms • 1 Tick ~= 1 interpreter • GIL released and instruction reacquired by current thread on interruptCopyright (C) 2010, David Beazley, http://www.dabeaz.com 34
35. 35. Python Thread Switching Run 100 Run 100 Run 100 ticks ticks ticks CPU Bound Thread e e e e e e as uir le q as uir le q as uir le q re ac re ac re ac • Every 100 VM instructions, GIL is dropped, allowing other threads to run if they want • Not time based--switching interval depends on kind of instructions executedCopyright (C) 2010, David Beazley, http://www.dabeaz.com 35
36. 36. Ruby Thread Switching Timer Timer (10ms) Timer (10ms) Thread CPU Bound Run Run Thread e e as uir e e as uir le q le q re ac re ac • Loosely mimics the time-slice of the OS • Every 10ms, GIL is released/acquiredCopyright (C) 2010, David Beazley, http://www.dabeaz.com 36
37. 37. A Common Theme • Both Python and Ruby have C code like this: void execute() { while (inst = next_instruction()) { // Run the VM instruction ... if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } } } • Exact details vary, but concept is the same • Each thread has periodic release/acquire in the VM to allow other threads to runCopyright (C) 2010, David Beazley, http://www.dabeaz.com 37
38. 38. Question • What can go wrong with this bit of code? if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } • Short answer: Everything!Copyright (C) 2010, David Beazley, http://www.dabeaz.com 38
39. 39. PathologyCopyright (C) 2010, David Beazley, http://www.dabeaz.com 39
43. 43. Thread Switching • You might expect that Thread 2 will run pt m ee pr Running Thread 1 READY release GIL pthreads/OS acquire schedule GIL Running Thread 2 READY • But you assume the GIL plays nice...Copyright (C) 2010, David Beazley, http://www.dabeaz.com 43
44. 44. Thread Switching • What might actually happen on multicore pt m ee pr Running Running Thread 1 release acquire GIL GIL pthreads/OS schedule fails (GIL locked) Thread 2 READY READY • Both threads attempt to run simultaneously • ... but only one will succeed (depends on timing)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 44
45. 45. Fallacy • This code doesnt actually switch threads if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 45
46. 46. Fallacy • This doesnt force switching (sleeping) if (must_release_gil) { GIL_release(); sleep(0); /* Other threads may run now */ GIL_acquire(); } • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 46
47. 47. Fallacy • Neither does this (calling the scheduler) if (must_release_gil) { GIL_release(); sched_yield() /* Other threads may run now */ GIL_acquire(); } • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 47
48. 48. A Conﬂict • There are conﬂicting goals • Python/Ruby - wants to run on a single CPU, but doesnt want to do thread scheduling (i.e., let the OS do it). • OS - "Oooh. Multiple cores." Schedules as many runnable tasks as possible at any instant • Result: Threads ﬁght with each otherCopyright (C) 2010, David Beazley, http://www.dabeaz.com 48
49. 49. Multicore GIL Battle • Python 2.7 on OS-X (4 cores) Sequential : 6.12s Threaded (2 threads) : 9.28s (1.5x slower!) pt pt pt em em em p re p re pr e 100 ticks 100 ticks Thread 1 ... READY release acquire release acquire pthreads/OS Eventually... schedule fail schedule fail run Thread 2 READY READY READY • Millions of failed GIL acquisitionsCopyright (C) 2010, David Beazley, http://www.dabeaz.com 49
50. 50. Multicore GIL Battle • You can see it! (2 CPU-bound threads) Why >100%? • Comment: In Python, its very rapid • GIL is released every few microseconds!Copyright (C) 2010, David Beazley, http://www.dabeaz.com 50
51. 51. I/O Handling • If there is a CPU-bound thread, I/O bound threads have a hard time getting the GIL Thread 1 (CPU 1) Thread 2 (CPU 2) run sleep preempt Network Packet run Acquire GIL (fails) preempt run Acquire GIL (fails) Might repeat preempt 100s-1000s of times run Acquire GIL (fails) preempt Acquire GIL (success) runCopyright (C) 2010, David Beazley, http://www.dabeaz.com 51
52. 52. Messaging Pathology • Messaging on Linux (8 Cores) Ruby 1.9 (no threads) : 1.18s Ruby 1.9 (1 CPU thread) : 5839.4s • Locks in Linux have no fairness • Consequence: Really hard to steal the GIL • And Ruby only retries every 10msCopyright (C) 2010, David Beazley, http://www.dabeaz.com 52
53. 53. Lets Talk Fairness • Fair-locking means that locks have some notion of priorities, arrival order, queuing, etc. running waiting t0 Lock t1 t2 t3 t4 t5 release running waiting t1 Lock t2 t3 t4 t5 t0 • Releasing means you go to end of lineCopyright (C) 2010, David Beazley, http://www.dabeaz.com 53
54. 54. Effect of Fair-Locking • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) : 42.0s Messages + 1 CPU Thread (Linux) : 5839.4s • Question: Which one uses fair locking?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 54
55. 55. Effect of Fair-Locking • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) : 42.0s (Fair) Messages + 1 CPU Thread (Linux) : 5839.4s • Beneﬁt : I/O threads get their turn (yay!)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 55
56. 56. Effect of Fair-Locking • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) : 42.0s (Fair) Messages + 1 CPU Thread (Linux) : 5839.4s • Beneﬁt : I/O threads get their turn (yay!) • Python 2.7 (multiple cores) 2 CPU-Bound Threads (OS-X) : 9.28s 2 CPU-Bound Threads (Windows) : 63.0s • Question: Which one uses fair-locking?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 56
57. 57. Effect of Fair-Locking • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) : 42.0s (Fair) Messages + 1 CPU Thread (Linux) : 5839.4s • Beneﬁt : I/O threads get their turn (yay!) • Python 2.7 (multiple cores) 2 CPU-Bound Threads (OS-X) : 9.28s 2 CPU-Bound Threads (Windows) : 63.0s (Fair) • Problem: Too much context switchingCopyright (C) 2010, David Beazley, http://www.dabeaz.com 57
58. 58. Fair-Locking - Bah! • In reality, you dont want fairness • Messaging Revisited (OS X, 4 Cores) Ruby 1.9 (No Threads) : 1.29s Ruby 1.9 (1 CPU-Bound thread) : 42.0s (33x slower) • Why is it still 33x slower? • Answer: Fair locking! (and convoying)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 58
59. 59. Messaging Revisited • Go back to the messaging server def server(): while True: msg = recv() send(msg)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 59
60. 60. Messaging Revisited • The actual implementation (size-preﬁxed messages) def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 60
61. 61. Performance Explained • What actually happens under the covers def server(): while True: GIL release size = recv(4) GIL release msg = recv(size) GIL release send(size) GIL release send(msg) • Why? Each operation might block • Catch: Passes control back to CPU-bound threadCopyright (C) 2010, David Beazley, http://www.dabeaz.com 61
62. 62. Performance Illustrated Timer 10ms 10ms 10ms 10ms 10ms Thread run CPU Bound Thread run run run run run I/O recv recv send send done Thread Data Arrives • Each message has 40ms response cycle • 1000 messages x 40ms = 40s (42.0s measured)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 62
63. 63. DespairCopyright (C) 2010, David Beazley, http://www.dabeaz.com 63
64. 64. A Solution? Dont use threads! • Yes, yes, everyone hates threads • However, thats only because theyre useful! • Threads are used for all sorts of things • Even if theyre hidden behind the scenesCopyright (C) 2010, David Beazley, http://www.dabeaz.com 64
65. 65. A Better Solution Make the GIL better • Its probably not going away (very difﬁcult) • However, does it have to thrash wildly? • Question: Can you do anything?Copyright (C) 2010, David Beazley, http://www.dabeaz.com 65
66. 66. GIL Efforts in Python 3 • Python 3.2 has a new GIL implementation • Its imperfect--in fact, it has a lot of problems • However, people are experimenting with itCopyright (C) 2010, David Beazley, http://www.dabeaz.com 66
67. 67. Python 3 GIL • GIL acquisition now based on timeouts running Thread 1 drop_request release 5ms running Thread 2 IOWAIT READY wait(gil, TIMEOUT) wait(gil, TIMEOUT) data arrives • Involves waiting on a condition variableCopyright (C) 2010, David Beazley, http://www.dabeaz.com 67
68. 68. Problem: Convoying • CPU-bound threads signiﬁcantly degrade I/O running running running Thread 1 release 5ms 5ms 5ms run run Thread 2 READY READY READY data data data arrives arrives arrives • This is the same problem as in Ruby • Just a shorter time delay (5ms)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 68
69. 69. Problem: Convoying • You can directly observe the delays (messaging) Python/Ruby (No threads) : 1.29s (no delays) Python 3.2 (1 Thread) : 20.1s (5ms delays) Ruby 1.9 (1 Thread) : 42.0s (10ms delays) • Still not great, but problem is understoodCopyright (C) 2010, David Beazley, http://www.dabeaz.com 69
70. 70. PromiseCopyright (C) 2010, David Beazley, http://www.dabeaz.com 70
71. 71. Priorities • Best promise : Priority scheduling • Earlier versions of Ruby had it • It works (OS-X, 4 cores) Ruby 1.9 (1 Thread) : 42.0s Ruby 1.8.7 (1 Thread) : 40.2s Ruby 1.8.7 (1 Thread, lower priority) : 10.0s • Comment: Ruby-1.9 allows thread priorities to be set in pthreads, but it doesnt seem to have much (if any) effectCopyright (C) 2010, David Beazley, http://www.dabeaz.com 71
72. 72. Priorities • Experimental Python-3.2 with priority scheduler • Also features immediate preemption • Messages (OS X, 4 Cores) Python 3.2 (No threads) : 1.29s Python 3.2 (1 Thread) : 20.2s Python 3.2+priorities (1 Thread) : 1.21s (faster?) • Thats a lot more promising!Copyright (C) 2010, David Beazley, http://www.dabeaz.com 72
73. 73. New Problems • Priorities bring new challenges • Starvation • Priority inversion • Implementation complexity • Do you have to write a full OS scheduler? • Hopefully not, but its an open questionCopyright (C) 2010, David Beazley, http://www.dabeaz.com 73
74. 74. Final Words • Implementing a GIL is a lot trickier than it looks • Even work with priorities has problems • Good example of how multicore is diabolicalCopyright (C) 2010, David Beazley, http://www.dabeaz.com 74
75. 75. Thanks for Listening! • I hope you learned at least one new thing • Im always interested in feedback • Follow me on Twitter (@dabeaz)Copyright (C) 2010, David Beazley, http://www.dabeaz.com 75