Multithreading and Parallelism on iOS [MobOS 2013]
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,544
On Slideshare
2,523
From Embeds
21
Number of Embeds
1

Actions

Shares
Downloads
74
Comments
0
Likes
1

Embeds 21

http://lanyrd.com 21

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Multithreading and Parallelism on iOS Kuba Brecka @kubabrecka ! Mobile Operating Systems Conference MobOS 2013
  • 2. Agenda • Part I: Parallelism and multithreading overview • Part II: Thread-safety, GCD, operation queues • Part III: Synchronization, locking, memory model • Part IV: Performance tuning, ILP • Part V: (at the party) Whatever you’d like to discuss
  • 3. Multithreading and Parallelism on iOS Part I: Parallelism and multithreading overview
  • 4. Quiz 1 int a; ! - (void)method { a = 0; ! ! ! ! } dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); dispatch_async(queue, ^{ a = 1; }); dispatch_async(queue, ^{ a = 2; }); NSLog(@"%d", a);
  • 5. Quiz 2 int a; ! - (void)method { a = 0; dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); ! dispatch_async(queue, ^{ a = 1; }); while (a == 0) { // wait } ! } NSLog(@"%d", a);
  • 6. Parallelism is a huge topic
  • 7. Terminology • Parallel • Multi-threaded • Concurrent • Simultaneous • Asynchronous
  • 8. Why parallelize? • Responsiveness • “when I scroll, it’s smooth” • Performance • “it works fast” • Energy saving • “it doesn’t drain my battery” • Convenience • some things are parallel by nature, e.g. running two completely separate apps
  • 9. How? • Multiple processes • XPC, fork • Multiple threads • POSIX Threads, NSThread • High-level thread abstraction • Operation queues, dispatch queues • GPGPU • Instruction-level parallelism • superscalar CPUs, pipelining, vector instructions • Multiple PCs • servers, clouds
  • 10. Threads • What is a thread? • It’s an abstraction made by the OS • The CPU has no such concept • Represents a line of calculation • Has an ID, a stack, thread-local storage, priority, CPU registers • Shares memory and resources within a process • The OS scheduler runs/pauses threads • context switching
  • 11. Issues with threading • Race conditions • the result depends on the timing of the scheduler • the behavior is non-deterministic • can result in almost anything • crash, wrong result, corrupted data • So, you have to use locks/mutexes/… • More issues: deadlocks, livelocks, starvation • Even the best guys have trouble with these • Security consequences, vulnerabilities
  • 12. Know your enemy • The compiler • The CPU • The memory • Time • Your brain
  • 13. The iPhone has matured iPhone 4 iPhone 4S iPhone 5 iPhone 5S 512 MB RAM 512 MB RAM 1 GB RAM 1 GB RAM A4 SoC (1 core) A5 SoC (2 core) A6 SoC (2 core) A7 SoC (2 core) 800 MHz 800 MHz 1300 MHz 1300 MHz
  • 14. ARM has matured • Apple A5 (2011) • ARM Cortex-A9 MPCore • 2 cores • out-of-order execution • speculative execution • superscalar, pipelining (8 stages) • NEON 128-bit SIMD • Apple A7 (2013) • ARMv8-A “Cyclone” • 64-bit, 32 registers, per-core L1 cache
  • 15. iOS has matured • The kernel knows a lot more about the system than the developer • GCD • Operation Queues • LLVM, compiler optimizations • GPU computations • Accelerate.framework
  • 16. iOS threading technologies • Multiple processes – forking disabled, no XPC • Low-level threads • POSIX Threads (pthread) • NSThread • -[NSObject performSelectorInBackground:withObject:] • Higher-level abstractions • NSOperationQueue, NSOperation • GCD
  • 17. Is multithreading hard? • Yes, if you don’t know what you’re doing. • But that’s true for anything. • Paul E. McKenney: Is Parallel Programming Hard, And, If So, What Can You Do About It? (2013) • https://www.kernel.org/pub/linux/kernel/people/ paulmck/perfbook/perfbook.html
  • 18. You need to know how it works • The abstractions you use (threads, dispatch queues) are leaky • You still must know how it works below: • CPU • OS • compiler (LLVM) • libraries and 3rd party code you are using • language specification • language implementation • + the abstraction you are using (GCD)
  • 19. You need to know even more • Often you parallelize to get better performance • For this you need to know • CPU architecture details • CPU instruction latencies • memory hierarchy and latencies
  • 20. Parallelizing tasks vs. algorithms • Task = a standalone unit of work • has some inputs, gives some outputs • “add a blur effect to these 1000 photos” • 1 photo = 1 task (independent) • “add a blur effect to this one 5000x5000px photo” • 1 task = ? • Some algorithms simply cannot be parallelized (you will not get any significant speedup)
  • 21. Multithreading and Parallelism on iOS Part II: Thread-safety, GCD, operation queues
  • 22. What’s thread safety? • “Thread-safe object” • you can safely use the object from multiple threads at the same time • the internal state of the object will not get corrupted and it will behave correctly • When you don’t know if an object is thread-safe, you have to assume it isn’t • How do you make your object thread-safe? • immutability, locks, atomic reads/writes
  • 23. Shared mutable state • Exclusive immutable object = no problem • Shared immutable object = no problem • Exclusive mutable object = no problem • Shared mutable object • root of all evil • you always want to minimize this
  • 24. Global variables • “Global variables are bad” • Multi-threading is another very good reason not to use global variables / global state • Global variables are always shared • Watch out for “hidden” global state: • working directory, chdir() • environment variables, putenv()
  • 25. Thread-safety vs. iOS • Terrible lack of proper documentation • Most of the low-level Obj-C runtime is thread-safe • memory management, ARC, weak references, … • Immutable objects (NSString, NSArray, …) are threadsafe • A few other classes are thread-safe • Usually it’s thread-safe to call class methods • google for “iOS thread safety” • https://developer.apple.com/library/ios/ DOCUMENTATION/Cocoa/Conceptual/Multithreading/ ThreadSafetySummary/ThreadSafetySummary.html
  • 26. POSIX threads • “plain threads” • C API • if you want to pass an object to the new thread, you will have issues with memory management • Synchronization • mutexes, conditions, R/W locks, barriers
  • 27. POSIX thread API • pthread_create • pthread_join • mutex • pthread_mutex_init, pthread_mutex_lock, pthread_mutex_unlock • conditions • pthread_cond_init, pthread_cond_signal, pthread_cond_wait
  • 28. NSThread • “plain threads” as well • Obj-C API • mostly just a wrapper around POSIX threads • memory management just works • Synchronize with NSLock, NSCondition, …
  • 29. NSThread API • -[NSThread initWithTarget:selector:object:] • -[NSThread start] • +[NSThread detachNewThreadSelector:toTarget:withObject:] • subclassing NSThread • -[NSObject performSelectorInBackground:withObject:]
  • 30. Thread-specific properties • Thread-local storage • Thread priorities • Autorelease pools • Detached vs. joinable
  • 31. Grand Central Dispatch • Let’s not think about threads • Instead, let’s think about tasks • New concepts: • Tasks • Queues • Queue-specific data • Dispatch groups • Dispatch sources • Synchronization • Semaphores, barriers • C API (!) but has ARC and works with blocks
  • 32. GCD queues • Main queue • there is just one, executed on the main thread • Concurrent queue • tasks run concurrently • 4 pre-made concurrent queues with different priorities • DISPATCH_QUEUE_PRIORITY_DEFAULT, _HIGH, _LOW, _BACKGROUND • you can make your own • Serial queue • only one task at a time, in order • you can make your own
  • 33. GCD task API • Get/create a queue: • dispatch_get_global_queue • dispatch_get_main_queue • dispatch_queue_create • Submit task: • dispatch_sync • dispatch_async • dispatch_apply
  • 34. GCD convenience API • dispatch_once • guarantees the code run only run once • use to implement a proper and fast singleton • dispatch_after • execute the task at a specific time
  • 35. It’s not threads • GCD uses threads, but the threads are completely managed by GCD • You can’t assume your code will run on any specific thread • even two tasks from the same serial queue can run on different threads • Don’t use thread-local storage • Don’t use thread priorities
  • 36. Operation queues • A similar abstraction to GCD, this time you have: • NSOperation • either a block, a method call or custom subclass • concurrent or non-concurrent • dependencies on other NSOperations • support for cancellation • NSOperationQueue • executes the operations, or you can execute an operation directly
  • 37. Operation queues API • -[NSOperationQueue addOperation:] • -[NSOperationQueue addOperationWithBlock:] • -[NSOperation addDependency:] • +[NSBlockOperation blockOperationWithBlock:] • -[NSInvocationOperation initWithTarget:selector:object:]
  • 38. Comparison • POSIX threads, NSThread • thread-based • you have control over the lifetime of threads • overhead when creating • memory-management issues • GCD, operation queues • task-based • nice API with objects/blocks • operation queues • dependencies
  • 39. Run loops and messaging • Avoid shared mutable state • For POSIX threads and NSThreads: • put your thread into an event loop, where it just waits until an event occurs • the main thread has this by default • hidden inside UIApplicationMain • then you can communicate with the thread through: • -[NSObject performSelector:onThread:withObject:waitUntilDone:]
  • 40. Run loop API • +[NSRunLoop currentRunLoop] • -[NSRunLoop run] • you have to add at least one input source or it will return immediately • but you can add an empty port • [NSMachPort port] • -[NSRunLoop addPort:forMode:]
  • 41. Main thread • first thread = main thread = UI thread • all rendering • all layout • scrolling, panning, zooming • user input (touches, on-screen keyboard, external keyboard) • system events • Yes, that’s a lot of work. • 60 FPS = 16 ms per frame • Yes, that’s very little time.
  • 42. Offload the main thread • Goal: Keep the UI thread responsive • Rule: • Do as much work as possible on other threads • Well, but… • Do as little work as possible in the background, that is just enough to keep the main thread responsive • Measure, measure, measure
  • 43. Rendering and animations • Your app doesn’t have access to the GPU/display • Background process called “backboardd” • IPC – rendering commands • Shared memory – backing stores • CAAnimations are transferred to backboardd and performed without any communication with your app
  • 44. Demo 1 https://github.com/kubabrecka/mobos-ios
  • 45. Multithreading and Parallelism on iOS Part III: Synchronization, locking, memory model
  • 46. Demo 2 https://github.com/kubabrecka/mobos-ios
  • 47. Only trust what’s guaranteed • The order of things isn’t guaranteed unless someone tell you: int a, b; // global variables ! // thread 1 b = 20; a = 10; // thread 2 wait for a to be 10 NSLog(@“%d”, b); // ?
  • 48. Solutions • Avoid shared mutable state • communicate by message passing • design your objects as immutable • avoid multithreading • Synchronization • You must always have “a plan” • if you can’t tell which code is supposed to run in which thread, then nobody can help you • if you can’t tell which data can be accessed from which thread, then nobody can help you
  • 49. So what is guaranteed? • Semantics for one thread • “the (single-threaded) code you wrote will have the correct result” • For multi-threaded code, you have to obtain guarantees by using: • Atomic data types, volatile keyword • Locks, semaphores, memory barriers • For 3rd party code, generally you can’t assume anything
  • 50. Atomic types • Which data types are atomic? • Depends on the architecture! • Pointers and “native” integers are usually atomic • What does an atomic data type guarantee? • Also depends on the architecture! • A single read or a single write is usually atomic • Definitely not “i++” • OSAtomicIncrement, …
  • 51. Objective-C atomic properties • @property (atomic) int a; • Only affects auto-generated getters and setters • Again, a single read is atomic, a single write is atomic • Again, “obj.a++” is not atomic • It has no effect on direct member access, obj->a • “atomic” is default
  • 52. Objective-C messaging • Is the order of Obj-C method calls guaranteed? • It seems so, the current compilers don’t optimize through the dynamic dispatch (objc_msgSend) • But it’s still not guaranteed • This might (and probably will) change in the future
  • 53. Volatile keyword • don’t confuse with Java volatile • prevents some compiler optimizations • the variable can change on its own • doesn’t give you atomicity • doesn’t give you ordering • there are better means of synchronization
  • 54. Locks • Mutexes, critical sections • allow only a single thread to be in this part of code at the same time • -[NSLock lock] • -[NSLock unlock] • @synchronized { … } • uses an implicit lock, which exists on each object • handles exceptions • Recursive locks, R/W locks, conditions
  • 55. Lock-free algorithms and data structures • Some concurrent structures (hash tables, queues) can be written without using explicit locks • Currently a major topic in CS • databases • The name is confusing though, there is still a lot of locking happening • cache coherency • memory bus locking for complex atomic operations
  • 56. Memory barriers • Locks can be expensive • Memory barrier ensured ordering without locking • Memory reads and writes happen on the other side of the barrier • But the guarantee is only at the point of the barrier! • OSMemoryBarrier
  • 57. Is the trouble worth it? • Measure! • OK, so you need more than a single thread • use task-level parallelization (GCD) with clear input and output, use immutable data and message passing • Measure again! • OK, so you need more than that • find the bottleneck, don’t assume • is it really the CPU? Isn’t the bottleneck in the memory/ network/disk?
  • 58. Demo 3 https://github.com/kubabrecka/mobos-ios
  • 59. Multithreading and Parallelism on iOS Part IV: Performance tuning, ILP
  • 60. Multithreading isn’t everything • There are plenty of ways to make your code run faster • avoiding unnecessary work • choosing better algorithms • calculations on the GPU • using vector instructions (AVX, SSE, NEON) • hand-optimizing your assembly • tweaking the compiler optimizations
  • 61. The bottleneck • It’s easy to make wrong assumptions • Your bottleneck can be • CPU • Memory • I/O (disk, network) • GPU • There is no “usually”
  • 62. Some common UI issues • Creating UIViews is slow • reuse views, dequeue cells in tables • Loading images is slow • cache images • Rendering is slow • avoid drawRect, consider rasterization of flattened views • Scrolling is slow • don’t do heavy work in scrollViewDidScroll • Rendering shadows is slow • use shadowPath • Rendering layer masks is slow • pre-render
  • 63. Choose your data structures • -[NSArray containsObject:] • O(n) • -[NSSet containsObject:] • O(1)
  • 64. Always profile first • Don’t guess, measure! • Amdahl’s law • Hardware is cheap, programmers are expensive
  • 65. Profiling with Instruments • What can you measure with Instruments? • CPU • utilization • all performance counters (interrupts, syscalls, user/kernel time, …) • Memory • free memory • allocations, leaks, “zombies” • many more performance counters (page faults, cache hits/misses, …) • Network • Battery usage • Display FPS • Single process / multiple processes •…
  • 66. Measure carefully • Instruments isn’t perfect • Sampling is only a statistic method • Real device behave very differently than simulators • Hardware is different • Compiled code is different (both yours and libraries) • Verify your assumptions • In many cases, wrapping your code with two calls to [NSDate date] and subtracting is the best approach
  • 67. Optimize memory/cache accesses • Cache lines (64 B) • Try to linearize memory accesses • Choose correct data structures • array of structs vs. struct of arrays • Aligned memory accesses
  • 68. Instruction-level parallelism • The compiler tries to maximize ILP with scheduling • The main obstacle is data dependency • a series of arithmetic operations which depend on each other simply cannot be parallelized • independent operations are easily parallelized • CPU is superscalar and has deep pipelines • the problem is that often the compiler can’t be sure about the dependency • memory accesses, aliasing • it has to assume the dependency is there
  • 69. Help the compiler • The compiler is smart: • GCC: dead code elimination, common subexpression elimination, forward propagation, loop unrolling, tail call elimination, loop invariant motion, lower complex arithmetic, vectorization, modulo scheduling, … • Sometimes, it would like to be smart, but it can’t: • the C “restrict” keyword (C99): void * memcpy(void * restrict s1, const void * restrict s2, size_t n);
  • 70. Vector instructions • SIMD = Single Instruction Multiple Data • ARM NEON • 128-bit instructions (e.g. 4x 32-bit or 16x 8-bit at once) • LLVM auto-vectorizer • Often you have to change your data structure • alignment • interleaved values
  • 71. Accelerate.framework • Heavily optimized built-in framework for: • image processing • image format conversion and encoding/decoding • DSP, FFT • various general math on “large” data #include <Accelerate/Accelerate.h> ! vFloat vx = { 1.f, 2.f, 3.f, 4.f }; vFloat vy; ... vy = vsinf(vx);
  • 72. Away from the CPU • GPGPU • Only through OpenGL ES shaders • Perfect for image processing (Core Image, GPUImage) • M7 motion coprocessor (iPhone 5S)
  • 73. Thank you for your attention.
  • 74. Multithreading and Parallelism on iOS Kuba Brecka @kubabrecka ! Mobile Operating Systems Conference MobOS 2013