Multithreading and Parallelism on iOS [MobOS 2013]

7,458 views

Published on

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,458
On SlideShare
0
From Embeds
0
Number of Embeds
51
Actions
Shares
0
Downloads
228
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Multithreading and Parallelism on iOS [MobOS 2013]

  1. 1. Multithreading and Parallelism on iOS Kuba Brecka @kubabrecka ! Mobile Operating Systems Conference MobOS 2013
  2. 2. Agenda • Part I: Parallelism and multithreading overview • Part II: Thread-safety, GCD, operation queues • Part III: Synchronization, locking, memory model • Part IV: Performance tuning, ILP • Part V: (at the party) Whatever you’d like to discuss
  3. 3. Multithreading and Parallelism on iOS Part I: Parallelism and multithreading overview
  4. 4. Quiz 1 int a; ! - (void)method { a = 0; ! ! ! ! } dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); dispatch_async(queue, ^{ a = 1; }); dispatch_async(queue, ^{ a = 2; }); NSLog(@"%d", a);
  5. 5. Quiz 2 int a; ! - (void)method { a = 0; dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); ! dispatch_async(queue, ^{ a = 1; }); while (a == 0) { // wait } ! } NSLog(@"%d", a);
  6. 6. Parallelism is a huge topic
  7. 7. Terminology • Parallel • Multi-threaded • Concurrent • Simultaneous • Asynchronous
  8. 8. Why parallelize? • Responsiveness • “when I scroll, it’s smooth” • Performance • “it works fast” • Energy saving • “it doesn’t drain my battery” • Convenience • some things are parallel by nature, e.g. running two completely separate apps
  9. 9. How? • Multiple processes • XPC, fork • Multiple threads • POSIX Threads, NSThread • High-level thread abstraction • Operation queues, dispatch queues • GPGPU • Instruction-level parallelism • superscalar CPUs, pipelining, vector instructions • Multiple PCs • servers, clouds
  10. 10. Threads • What is a thread? • It’s an abstraction made by the OS • The CPU has no such concept • Represents a line of calculation • Has an ID, a stack, thread-local storage, priority, CPU registers • Shares memory and resources within a process • The OS scheduler runs/pauses threads • context switching
  11. 11. Issues with threading • Race conditions • the result depends on the timing of the scheduler • the behavior is non-deterministic • can result in almost anything • crash, wrong result, corrupted data • So, you have to use locks/mutexes/… • More issues: deadlocks, livelocks, starvation • Even the best guys have trouble with these • Security consequences, vulnerabilities
  12. 12. Know your enemy • The compiler • The CPU • The memory • Time • Your brain
  13. 13. The iPhone has matured iPhone 4 iPhone 4S iPhone 5 iPhone 5S 512 MB RAM 512 MB RAM 1 GB RAM 1 GB RAM A4 SoC (1 core) A5 SoC (2 core) A6 SoC (2 core) A7 SoC (2 core) 800 MHz 800 MHz 1300 MHz 1300 MHz
  14. 14. ARM has matured • Apple A5 (2011) • ARM Cortex-A9 MPCore • 2 cores • out-of-order execution • speculative execution • superscalar, pipelining (8 stages) • NEON 128-bit SIMD • Apple A7 (2013) • ARMv8-A “Cyclone” • 64-bit, 32 registers, per-core L1 cache
  15. 15. iOS has matured • The kernel knows a lot more about the system than the developer • GCD • Operation Queues • LLVM, compiler optimizations • GPU computations • Accelerate.framework
  16. 16. iOS threading technologies • Multiple processes – forking disabled, no XPC • Low-level threads • POSIX Threads (pthread) • NSThread • -[NSObject performSelectorInBackground:withObject:] • Higher-level abstractions • NSOperationQueue, NSOperation • GCD
  17. 17. Is multithreading hard? • Yes, if you don’t know what you’re doing. • But that’s true for anything. • Paul E. McKenney: Is Parallel Programming Hard, And, If So, What Can You Do About It? (2013) • https://www.kernel.org/pub/linux/kernel/people/ paulmck/perfbook/perfbook.html
  18. 18. You need to know how it works • The abstractions you use (threads, dispatch queues) are leaky • You still must know how it works below: • CPU • OS • compiler (LLVM) • libraries and 3rd party code you are using • language specification • language implementation • + the abstraction you are using (GCD)
  19. 19. You need to know even more • Often you parallelize to get better performance • For this you need to know • CPU architecture details • CPU instruction latencies • memory hierarchy and latencies
  20. 20. Parallelizing tasks vs. algorithms • Task = a standalone unit of work • has some inputs, gives some outputs • “add a blur effect to these 1000 photos” • 1 photo = 1 task (independent) • “add a blur effect to this one 5000x5000px photo” • 1 task = ? • Some algorithms simply cannot be parallelized (you will not get any significant speedup)
  21. 21. Multithreading and Parallelism on iOS Part II: Thread-safety, GCD, operation queues
  22. 22. What’s thread safety? • “Thread-safe object” • you can safely use the object from multiple threads at the same time • the internal state of the object will not get corrupted and it will behave correctly • When you don’t know if an object is thread-safe, you have to assume it isn’t • How do you make your object thread-safe? • immutability, locks, atomic reads/writes
  23. 23. Shared mutable state • Exclusive immutable object = no problem • Shared immutable object = no problem • Exclusive mutable object = no problem • Shared mutable object • root of all evil • you always want to minimize this
  24. 24. Global variables • “Global variables are bad” • Multi-threading is another very good reason not to use global variables / global state • Global variables are always shared • Watch out for “hidden” global state: • working directory, chdir() • environment variables, putenv()
  25. 25. Thread-safety vs. iOS • Terrible lack of proper documentation • Most of the low-level Obj-C runtime is thread-safe • memory management, ARC, weak references, … • Immutable objects (NSString, NSArray, …) are threadsafe • A few other classes are thread-safe • Usually it’s thread-safe to call class methods • google for “iOS thread safety” • https://developer.apple.com/library/ios/ DOCUMENTATION/Cocoa/Conceptual/Multithreading/ ThreadSafetySummary/ThreadSafetySummary.html
  26. 26. POSIX threads • “plain threads” • C API • if you want to pass an object to the new thread, you will have issues with memory management • Synchronization • mutexes, conditions, R/W locks, barriers
  27. 27. POSIX thread API • pthread_create • pthread_join • mutex • pthread_mutex_init, pthread_mutex_lock, pthread_mutex_unlock • conditions • pthread_cond_init, pthread_cond_signal, pthread_cond_wait
  28. 28. NSThread • “plain threads” as well • Obj-C API • mostly just a wrapper around POSIX threads • memory management just works • Synchronize with NSLock, NSCondition, …
  29. 29. NSThread API • -[NSThread initWithTarget:selector:object:] • -[NSThread start] • +[NSThread detachNewThreadSelector:toTarget:withObject:] • subclassing NSThread • -[NSObject performSelectorInBackground:withObject:]
  30. 30. Thread-specific properties • Thread-local storage • Thread priorities • Autorelease pools • Detached vs. joinable
  31. 31. Grand Central Dispatch • Let’s not think about threads • Instead, let’s think about tasks • New concepts: • Tasks • Queues • Queue-specific data • Dispatch groups • Dispatch sources • Synchronization • Semaphores, barriers • C API (!) but has ARC and works with blocks
  32. 32. GCD queues • Main queue • there is just one, executed on the main thread • Concurrent queue • tasks run concurrently • 4 pre-made concurrent queues with different priorities • DISPATCH_QUEUE_PRIORITY_DEFAULT, _HIGH, _LOW, _BACKGROUND • you can make your own • Serial queue • only one task at a time, in order • you can make your own
  33. 33. GCD task API • Get/create a queue: • dispatch_get_global_queue • dispatch_get_main_queue • dispatch_queue_create • Submit task: • dispatch_sync • dispatch_async • dispatch_apply
  34. 34. GCD convenience API • dispatch_once • guarantees the code run only run once • use to implement a proper and fast singleton • dispatch_after • execute the task at a specific time
  35. 35. It’s not threads • GCD uses threads, but the threads are completely managed by GCD • You can’t assume your code will run on any specific thread • even two tasks from the same serial queue can run on different threads • Don’t use thread-local storage • Don’t use thread priorities
  36. 36. Operation queues • A similar abstraction to GCD, this time you have: • NSOperation • either a block, a method call or custom subclass • concurrent or non-concurrent • dependencies on other NSOperations • support for cancellation • NSOperationQueue • executes the operations, or you can execute an operation directly
  37. 37. Operation queues API • -[NSOperationQueue addOperation:] • -[NSOperationQueue addOperationWithBlock:] • -[NSOperation addDependency:] • +[NSBlockOperation blockOperationWithBlock:] • -[NSInvocationOperation initWithTarget:selector:object:]
  38. 38. Comparison • POSIX threads, NSThread • thread-based • you have control over the lifetime of threads • overhead when creating • memory-management issues • GCD, operation queues • task-based • nice API with objects/blocks • operation queues • dependencies
  39. 39. Run loops and messaging • Avoid shared mutable state • For POSIX threads and NSThreads: • put your thread into an event loop, where it just waits until an event occurs • the main thread has this by default • hidden inside UIApplicationMain • then you can communicate with the thread through: • -[NSObject performSelector:onThread:withObject:waitUntilDone:]
  40. 40. Run loop API • +[NSRunLoop currentRunLoop] • -[NSRunLoop run] • you have to add at least one input source or it will return immediately • but you can add an empty port • [NSMachPort port] • -[NSRunLoop addPort:forMode:]
  41. 41. Main thread • first thread = main thread = UI thread • all rendering • all layout • scrolling, panning, zooming • user input (touches, on-screen keyboard, external keyboard) • system events • Yes, that’s a lot of work. • 60 FPS = 16 ms per frame • Yes, that’s very little time.
  42. 42. Offload the main thread • Goal: Keep the UI thread responsive • Rule: • Do as much work as possible on other threads • Well, but… • Do as little work as possible in the background, that is just enough to keep the main thread responsive • Measure, measure, measure
  43. 43. Rendering and animations • Your app doesn’t have access to the GPU/display • Background process called “backboardd” • IPC – rendering commands • Shared memory – backing stores • CAAnimations are transferred to backboardd and performed without any communication with your app
  44. 44. Demo 1 https://github.com/kubabrecka/mobos-ios
  45. 45. Multithreading and Parallelism on iOS Part III: Synchronization, locking, memory model
  46. 46. Demo 2 https://github.com/kubabrecka/mobos-ios
  47. 47. Only trust what’s guaranteed • The order of things isn’t guaranteed unless someone tell you: int a, b; // global variables ! // thread 1 b = 20; a = 10; // thread 2 wait for a to be 10 NSLog(@“%d”, b); // ?
  48. 48. Solutions • Avoid shared mutable state • communicate by message passing • design your objects as immutable • avoid multithreading • Synchronization • You must always have “a plan” • if you can’t tell which code is supposed to run in which thread, then nobody can help you • if you can’t tell which data can be accessed from which thread, then nobody can help you
  49. 49. So what is guaranteed? • Semantics for one thread • “the (single-threaded) code you wrote will have the correct result” • For multi-threaded code, you have to obtain guarantees by using: • Atomic data types, volatile keyword • Locks, semaphores, memory barriers • For 3rd party code, generally you can’t assume anything
  50. 50. Atomic types • Which data types are atomic? • Depends on the architecture! • Pointers and “native” integers are usually atomic • What does an atomic data type guarantee? • Also depends on the architecture! • A single read or a single write is usually atomic • Definitely not “i++” • OSAtomicIncrement, …
  51. 51. Objective-C atomic properties • @property (atomic) int a; • Only affects auto-generated getters and setters • Again, a single read is atomic, a single write is atomic • Again, “obj.a++” is not atomic • It has no effect on direct member access, obj->a • “atomic” is default
  52. 52. Objective-C messaging • Is the order of Obj-C method calls guaranteed? • It seems so, the current compilers don’t optimize through the dynamic dispatch (objc_msgSend) • But it’s still not guaranteed • This might (and probably will) change in the future
  53. 53. Volatile keyword • don’t confuse with Java volatile • prevents some compiler optimizations • the variable can change on its own • doesn’t give you atomicity • doesn’t give you ordering • there are better means of synchronization
  54. 54. Locks • Mutexes, critical sections • allow only a single thread to be in this part of code at the same time • -[NSLock lock] • -[NSLock unlock] • @synchronized { … } • uses an implicit lock, which exists on each object • handles exceptions • Recursive locks, R/W locks, conditions
  55. 55. Lock-free algorithms and data structures • Some concurrent structures (hash tables, queues) can be written without using explicit locks • Currently a major topic in CS • databases • The name is confusing though, there is still a lot of locking happening • cache coherency • memory bus locking for complex atomic operations
  56. 56. Memory barriers • Locks can be expensive • Memory barrier ensured ordering without locking • Memory reads and writes happen on the other side of the barrier • But the guarantee is only at the point of the barrier! • OSMemoryBarrier
  57. 57. Is the trouble worth it? • Measure! • OK, so you need more than a single thread • use task-level parallelization (GCD) with clear input and output, use immutable data and message passing • Measure again! • OK, so you need more than that • find the bottleneck, don’t assume • is it really the CPU? Isn’t the bottleneck in the memory/ network/disk?
  58. 58. Demo 3 https://github.com/kubabrecka/mobos-ios
  59. 59. Multithreading and Parallelism on iOS Part IV: Performance tuning, ILP
  60. 60. Multithreading isn’t everything • There are plenty of ways to make your code run faster • avoiding unnecessary work • choosing better algorithms • calculations on the GPU • using vector instructions (AVX, SSE, NEON) • hand-optimizing your assembly • tweaking the compiler optimizations
  61. 61. The bottleneck • It’s easy to make wrong assumptions • Your bottleneck can be • CPU • Memory • I/O (disk, network) • GPU • There is no “usually”
  62. 62. Some common UI issues • Creating UIViews is slow • reuse views, dequeue cells in tables • Loading images is slow • cache images • Rendering is slow • avoid drawRect, consider rasterization of flattened views • Scrolling is slow • don’t do heavy work in scrollViewDidScroll • Rendering shadows is slow • use shadowPath • Rendering layer masks is slow • pre-render
  63. 63. Choose your data structures • -[NSArray containsObject:] • O(n) • -[NSSet containsObject:] • O(1)
  64. 64. Always profile first • Don’t guess, measure! • Amdahl’s law • Hardware is cheap, programmers are expensive
  65. 65. Profiling with Instruments • What can you measure with Instruments? • CPU • utilization • all performance counters (interrupts, syscalls, user/kernel time, …) • Memory • free memory • allocations, leaks, “zombies” • many more performance counters (page faults, cache hits/misses, …) • Network • Battery usage • Display FPS • Single process / multiple processes •…
  66. 66. Measure carefully • Instruments isn’t perfect • Sampling is only a statistic method • Real device behave very differently than simulators • Hardware is different • Compiled code is different (both yours and libraries) • Verify your assumptions • In many cases, wrapping your code with two calls to [NSDate date] and subtracting is the best approach
  67. 67. Optimize memory/cache accesses • Cache lines (64 B) • Try to linearize memory accesses • Choose correct data structures • array of structs vs. struct of arrays • Aligned memory accesses
  68. 68. Instruction-level parallelism • The compiler tries to maximize ILP with scheduling • The main obstacle is data dependency • a series of arithmetic operations which depend on each other simply cannot be parallelized • independent operations are easily parallelized • CPU is superscalar and has deep pipelines • the problem is that often the compiler can’t be sure about the dependency • memory accesses, aliasing • it has to assume the dependency is there
  69. 69. Help the compiler • The compiler is smart: • GCC: dead code elimination, common subexpression elimination, forward propagation, loop unrolling, tail call elimination, loop invariant motion, lower complex arithmetic, vectorization, modulo scheduling, … • Sometimes, it would like to be smart, but it can’t: • the C “restrict” keyword (C99): void * memcpy(void * restrict s1, const void * restrict s2, size_t n);
  70. 70. Vector instructions • SIMD = Single Instruction Multiple Data • ARM NEON • 128-bit instructions (e.g. 4x 32-bit or 16x 8-bit at once) • LLVM auto-vectorizer • Often you have to change your data structure • alignment • interleaved values
  71. 71. Accelerate.framework • Heavily optimized built-in framework for: • image processing • image format conversion and encoding/decoding • DSP, FFT • various general math on “large” data #include <Accelerate/Accelerate.h> ! vFloat vx = { 1.f, 2.f, 3.f, 4.f }; vFloat vy; ... vy = vsinf(vx);
  72. 72. Away from the CPU • GPGPU • Only through OpenGL ES shaders • Perfect for image processing (Core Image, GPUImage) • M7 motion coprocessor (iPhone 5S)
  73. 73. Thank you for your attention.
  74. 74. Multithreading and Parallelism on iOS Kuba Brecka @kubabrecka ! Mobile Operating Systems Conference MobOS 2013

×