Multithreading and Parallelism on iOS [MobOS 2013]

Multithreading and Parallelism on iOS

Kuba Brecka
@kubabrecka
!

Mobile Operating Systems Conference MobOS 2013

Agenda

• Part I: Parallelism and multithreading overview
• Part II: Thread-safety, GCD, operation queues
• Part III: Synchronization, locking, memory model
• Part IV: Performance tuning, ILP
• Part V: (at the party) Whatever you’d like to discuss

Part I: Parallelism and multithreading overview

Quiz 1
int a;

!

- (void)method
{
a = 0;

!
!

!

!
}

dispatch_queue_t queue = dispatch_get_global_queue(
DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_async(queue, ^{
a = 1;
});
a = 2;
});
NSLog(@"%d", a);

Quiz 2
int a;

!

- (void)method
{
a = 0;
dispatch_queue_t queue = dispatch_get_global_queue(
DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

!

a = 1;
});
while (a == 0) {
// wait
}

!
}

NSLog(@"%d", a);

Terminology

• Parallel
• Multi-threaded
• Concurrent
• Simultaneous
• Asynchronous

Why parallelize?

• Responsiveness
• “when I scroll, it’s smooth”

• Performance

• “it works fast”

• Energy saving
• “it doesn’t drain my battery”

• Convenience

• some things are parallel by nature, e.g. running two
completely separate apps

How?
• Multiple processes
• XPC, fork

• Multiple threads
• POSIX Threads, NSThread

• High-level thread abstraction
• Operation queues, dispatch queues

• GPGPU
• Instruction-level parallelism
• superscalar CPUs, pipelining, vector instructions

• Multiple PCs
• servers, clouds

Threads

• What is a thread?
• It’s an abstraction made by the OS
• The CPU has no such concept

• Represents a line of calculation
• Has an ID, a stack, thread-local storage, priority, CPU
registers

• Shares memory and resources within a process

• The OS scheduler runs/pauses threads
• context switching

Issues with threading

• Race conditions
• the result depends on the timing of the scheduler
• the behavior is non-deterministic
• can result in almost anything
• crash, wrong result, corrupted data

• So, you have to use locks/mutexes/…
• More issues: deadlocks, livelocks, starvation

• Even the best guys have trouble with these
• Security consequences, vulnerabilities

Know your enemy

• The compiler
• The CPU
• The memory
• Time
• Your brain

The iPhone has matured

iPhone 4
iPhone 4S
iPhone 5
iPhone 5S
512 MB RAM 512 MB RAM
1 GB RAM
1 GB RAM
A4 SoC (1 core) A5 SoC (2 core) A6 SoC (2 core) A7 SoC (2 core)
800 MHz
800 MHz
1300 MHz
1300 MHz

ARM has matured

• Apple A5 (2011)
• ARM Cortex-A9 MPCore
• 2 cores
• out-of-order execution
• speculative execution
• superscalar, pipelining (8 stages)
• NEON 128-bit SIMD

• Apple A7 (2013)

• ARMv8-A “Cyclone”
• 64-bit, 32 registers, per-core L1 cache

iOS has matured

• The kernel knows a lot more about the system than
the developer

• GCD
• Operation Queues
• LLVM, compiler optimizations
• GPU computations
• Accelerate.framework

iOS threading technologies

• Multiple processes – forking disabled, no XPC
• Low-level threads
• POSIX Threads (pthread)
• NSThread
• -[NSObject performSelectorInBackground:withObject:]

• Higher-level abstractions

• NSOperationQueue, NSOperation
• GCD

Is multithreading hard?

• Yes, if you don’t know what you’re doing.
• But that’s true for anything.

• Paul E. McKenney: Is Parallel Programming Hard,
And, If So, What Can You Do About It? (2013)

• https://www.kernel.org/pub/linux/kernel/people/
paulmck/perfbook/perfbook.html

You need to know how it works

• The abstractions you use (threads, dispatch queues)
are leaky

• You still must know how it works below:
• CPU
• OS
• compiler (LLVM)
• libraries and 3rd party code you are using
• language specification
• language implementation
• + the abstraction you are using (GCD)

You need to know even more

• Often you parallelize to get better performance
• For this you need to know
• CPU architecture details
• CPU instruction latencies
• memory hierarchy and latencies

Parallelizing tasks vs. algorithms

• Task = a standalone unit of work
• has some inputs, gives some outputs

• “add a blur eﬀect to these 1000 photos”
• 1 photo = 1 task (independent)

• “add a blur eﬀect to this one 5000x5000px photo”
• 1 task = ?

• Some algorithms simply cannot be parallelized (you
will not get any significant speedup)

Part II: Thread-safety, GCD, operation queues

What’s thread safety?

• “Thread-safe object”
• you can safely use the object from multiple threads at
the same time

• the internal state of the object will not get corrupted and
it will behave correctly

• When you don’t know if an object is thread-safe,
you have to assume it isn’t

• How do you make your object thread-safe?
• immutability, locks, atomic reads/writes

Shared mutable state

• Exclusive immutable object = no problem
• Shared immutable object = no problem
• Exclusive mutable object = no problem
• Shared mutable object
• root of all evil
• you always want to minimize this

Global variables

• “Global variables are bad”
• Multi-threading is another very good reason not to
use global variables / global state

• Global variables are always shared
• Watch out for “hidden” global state:
• working directory, chdir()
• environment variables, putenv()

Thread-safety vs. iOS

• Terrible lack of proper documentation
• Most of the low-level Obj-C runtime is thread-safe
• memory management, ARC, weak references, …

• Immutable objects (NSString, NSArray, …) are threadsafe

• A few other classes are thread-safe
• Usually it’s thread-safe to call class methods
• google for “iOS thread safety”
• https://developer.apple.com/library/ios/

DOCUMENTATION/Cocoa/Conceptual/Multithreading/
ThreadSafetySummary/ThreadSafetySummary.html

POSIX threads

• “plain threads”
• C API
• if you want to pass an object to the new thread, you will
have issues with memory management

• Synchronization
• mutexes, conditions, R/W locks, barriers

POSIX thread API

• pthread_create
• pthread_join
• mutex
• pthread_mutex_init, pthread_mutex_lock,
pthread_mutex_unlock

• conditions
• pthread_cond_init, pthread_cond_signal,
pthread_cond_wait

NSThread

• “plain threads” as well
• Obj-C API
• mostly just a wrapper around POSIX threads
• memory management just works

• Synchronize with NSLock, NSCondition, …

NSThread API

• -[NSThread initWithTarget:selector:object:]
• -[NSThread start]
• +[NSThread

detachNewThreadSelector:toTarget:withObject:]

• subclassing NSThread
• -[NSObject

performSelectorInBackground:withObject:]

Thread-specific properties

• Thread-local storage
• Thread priorities
• Autorelease pools
• Detached vs. joinable

Grand Central Dispatch
• Let’s not think about threads
• Instead, let’s think about tasks
• New concepts:
• Tasks
• Queues
• Queue-specific data
• Dispatch groups
• Dispatch sources

• Synchronization
• Semaphores, barriers

• C API (!) but has ARC and works with blocks

GCD queues

• Main queue
• there is just one, executed on the main thread

• Concurrent queue
• tasks run concurrently
• 4 pre-made concurrent queues with diﬀerent priorities
• DISPATCH_QUEUE_PRIORITY_DEFAULT, _HIGH, _LOW,
_BACKGROUND

• you can make your own

• Serial queue
• only one task at a time, in order
• you can make your own

GCD task API

• Get/create a queue:
• dispatch_get_global_queue
• dispatch_get_main_queue
• dispatch_queue_create
• Submit task:
• dispatch_sync
• dispatch_async
• dispatch_apply

GCD convenience API

• dispatch_once
• guarantees the code run only run once
• use to implement a proper and fast singleton

• dispatch_after

• execute the task at a specific time

It’s not threads

• GCD uses threads, but the threads are completely
managed by GCD

• You can’t assume your code will run on any specific
thread

• even two tasks from the same serial queue can run on
diﬀerent threads

• Don’t use thread-local storage
• Don’t use thread priorities

Operation queues

• A similar abstraction to GCD, this time you have:
• NSOperation
• either a block, a method call or custom subclass
• concurrent or non-concurrent
• dependencies on other NSOperations
• support for cancellation

• NSOperationQueue
• executes the operations, or you can execute an operation
directly

Operation queues API

• -[NSOperationQueue addOperation:]
• -[NSOperationQueue addOperationWithBlock:]
• -[NSOperation addDependency:]
• +[NSBlockOperation blockOperationWithBlock:]
• -[NSInvocationOperation
initWithTarget:selector:object:]

Comparison

• POSIX threads, NSThread
• thread-based
• you have control over the lifetime of threads
• overhead when creating
• memory-management issues

• GCD, operation queues

• task-based
• nice API with objects/blocks

• operation queues
• dependencies

Run loops and messaging

• Avoid shared mutable state
• For POSIX threads and NSThreads:
• put your thread into an event loop, where it just waits
until an event occurs

• the main thread has this by default
• hidden inside UIApplicationMain

• then you can communicate with the thread through:
• -[NSObject

performSelector:onThread:withObject:waitUntilDone:]

Run loop API

• +[NSRunLoop currentRunLoop]
• -[NSRunLoop run]
• you have to add at least one input source or it will return
immediately

• but you can add an empty port
• [NSMachPort port]

• -[NSRunLoop addPort:forMode:]

Main thread

• first thread = main thread = UI thread
• all rendering
• all layout
• scrolling, panning, zooming
• user input (touches, on-screen keyboard, external keyboard)
• system events

• Yes, that’s a lot of work.
• 60 FPS = 16 ms per frame
• Yes, that’s very little time.

Oﬄoad the main thread

• Goal: Keep the UI thread responsive
• Rule:
• Do as much work as possible on other threads

• Well, but…
• Do as little work as possible in the background,
that is just enough to keep the main thread
responsive

• Measure, measure, measure

Rendering and animations

• Your app doesn’t have access to the GPU/display
• Background process called “backboardd”
• IPC – rendering commands
• Shared memory – backing stores

• CAAnimations are transferred to backboardd and

performed without any communication with your
app

Demo 1
https://github.com/kubabrecka/mobos-ios

Part III: Synchronization, locking, memory model

Demo 2

Only trust what’s guaranteed

• The order of things isn’t guaranteed unless
someone tell you:

int a, b; // global variables
!

// thread 1
b = 20;
a = 10;

// thread 2
wait for a to be 10
NSLog(@“%d”, b); // ?

Solutions

• Avoid shared mutable state
• communicate by message passing
• design your objects as immutable
• avoid multithreading

• Synchronization
• You must always have “a plan”

• if you can’t tell which code is supposed to run in which
thread, then nobody can help you

• if you can’t tell which data can be accessed from which
thread, then nobody can help you

So what is guaranteed?

• Semantics for one thread
• “the (single-threaded) code you wrote will have the
correct result”

• For multi-threaded code, you have to obtain
guarantees by using:

• Atomic data types, volatile keyword
• Locks, semaphores, memory barriers

• For 3rd party code, generally you can’t assume
anything

Atomic types

• Which data types are atomic?
• Depends on the architecture!
• Pointers and “native” integers are usually atomic
• What does an atomic data type guarantee?
• Also depends on the architecture!
• A single read or a single write is usually atomic
• Definitely not “i++”
• OSAtomicIncrement, …

Objective-C atomic properties

• @property (atomic) int a;
• Only aﬀects auto-generated getters and setters
• Again, a single read is atomic, a single write is
atomic

• Again, “obj.a++” is not atomic
• It has no eﬀect on direct member access, obj->a
• “atomic” is default

Objective-C messaging

• Is the order of Obj-C method calls guaranteed?
• It seems so, the current compilers don’t optimize
through the dynamic dispatch (objc_msgSend)

• But it’s still not guaranteed
• This might (and probably will) change in the future

Volatile keyword

• don’t confuse with Java volatile
• prevents some compiler optimizations
• the variable can change on its own

• doesn’t give you atomicity
• doesn’t give you ordering
• there are better means of synchronization

Locks

• Mutexes, critical sections
• allow only a single thread to be in this part of code at the
same time

• -[NSLock lock]
• -[NSLock unlock]
• @synchronized { … }
• uses an implicit lock, which exists on each object
• handles exceptions

• Recursive locks, R/W locks, conditions

Lock-free algorithms and data structures

• Some concurrent structures (hash tables, queues)
can be written without using explicit locks

• Currently a major topic in CS
• databases

• The name is confusing though, there is still a lot of
locking happening

• cache coherency
• memory bus locking for complex atomic operations

Memory barriers

• Locks can be expensive
• Memory barrier ensured ordering without locking
• Memory reads and writes happen on the other side of
the barrier

• But the guarantee is only at the point of the barrier!

• OSMemoryBarrier

Is the trouble worth it?

• Measure!
• OK, so you need more than a single thread
• use task-level parallelization (GCD) with clear input and
output, use immutable data and message passing

• Measure again!
• OK, so you need more than that
• find the bottleneck, don’t assume
• is it really the CPU? Isn’t the bottleneck in the memory/
network/disk?

Demo 3

Part IV: Performance tuning, ILP

Multithreading isn’t everything

• There are plenty of ways to make your code run
faster

• avoiding unnecessary work
• choosing better algorithms
• calculations on the GPU
• using vector instructions (AVX, SSE, NEON)
• hand-optimizing your assembly
• tweaking the compiler optimizations

The bottleneck

• It’s easy to make wrong assumptions
• Your bottleneck can be
• CPU
• Memory
• I/O (disk, network)
• GPU

• There is no “usually”

Some common UI issues
• Creating UIViews is slow
• reuse views, dequeue cells in tables

• Loading images is slow
• cache images

• Rendering is slow
• avoid drawRect, consider rasterization of flattened views

• Scrolling is slow
• don’t do heavy work in scrollViewDidScroll

• Rendering shadows is slow
• use shadowPath

• Rendering layer masks is slow
• pre-render

Choose your data structures

• -[NSArray containsObject:]
• O(n)

• -[NSSet containsObject:]
• O(1)

Always profile first

• Don’t guess, measure!
• Amdahl’s law
• Hardware is cheap, programmers are expensive

Profiling with Instruments
• What can you measure with Instruments?
• CPU
• utilization
• all performance counters (interrupts, syscalls, user/kernel time, …)

• Memory
• free memory
• allocations, leaks, “zombies”
• many more performance counters (page faults, cache hits/misses, …)

• Network
• Battery usage
• Display FPS
• Single process / multiple processes
•…

Measure carefully

• Instruments isn’t perfect
• Sampling is only a statistic method

• Real device behave very differently than simulators
• Hardware is different
• Compiled code is different (both yours and libraries)

• Verify your assumptions
• In many cases, wrapping your code with two calls to
[NSDate date] and subtracting is the best approach

Optimize memory/cache accesses

• Cache lines (64 B)
• Try to linearize memory accesses
• Choose correct data structures
• array of structs vs. struct of arrays

• Aligned memory accesses

Instruction-level parallelism

• The compiler tries to maximize ILP with scheduling
• The main obstacle is data dependency
• a series of arithmetic operations which depend on each
other simply cannot be parallelized

• independent operations are easily parallelized
• CPU is superscalar and has deep pipelines

• the problem is that often the compiler can’t be sure
about the dependency

• memory accesses, aliasing
• it has to assume the dependency is there

Help the compiler

• The compiler is smart:
• GCC: dead code elimination, common subexpression

elimination, forward propagation, loop unrolling, tail call
elimination, loop invariant motion, lower complex
arithmetic, vectorization, modulo scheduling, …

• Sometimes, it would like to be smart, but it can’t:
• the C “restrict” keyword (C99):
void * memcpy(void * restrict s1, const void * restrict s2, size_t n);

Vector instructions

• SIMD = Single Instruction Multiple Data
• ARM NEON
• 128-bit instructions (e.g. 4x 32-bit or 16x 8-bit at once)

• LLVM auto-vectorizer
• Often you have to change your data structure
• alignment
• interleaved values

Accelerate.framework

• Heavily optimized built-in framework for:
• image processing
• image format conversion and encoding/decoding
• DSP, FFT
• various general math on “large” data
#include <Accelerate/Accelerate.h>
!

vFloat vx = { 1.f, 2.f, 3.f, 4.f };
vFloat vy;
...
vy = vsinf(vx);

Away from the CPU

• GPGPU
• Only through OpenGL ES shaders
• Perfect for image processing (Core Image, GPUImage)

• M7 motion coprocessor (iPhone 5S)

Multithreading and Parallelism on iOS [MobOS 2013]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Multithreading and Parallelism on iOS [MobOS 2013]

Similar to Multithreading and Parallelism on iOS [MobOS 2013] (20)

Recently uploaded

Recently uploaded (20)

Multithreading and Parallelism on iOS [MobOS 2013]