2. Concurrency and Parallelism
■ Quite often mixed together
■ But common problems are actual for both
concurrency and parallelism
3.
4. Process vs Thread
š Processes has separate address space
š Threads can share process resources
š For the kernel thread is a lightweight process
š Shared resources make pain… real
8. Locks and
synchronization
■ There is a bunch of methods for synchronising state
■ Locks/barriers/semaphores
■ Lock free algorithms (Compare and Set)
■ Lock free means often performance implications
18. Memory Barriers
L3 cache should be synchronised
Cache coherence protocol is executed
Memory fence instruction is issued
OOO execution should be finished
Pipelines flush
19. Memory Model
■ Mapping Virtual addresses to Physical requires
computation and/or Page Directories access
■ Even with L1 data it would cost 16 cycles
■ There is another cache for that
■ Translation Lookaside Buffer
20. Some attacks use
TLBleed
■ If address was accesses by
some thread then it is
cached in TLB
■ Another thread can measure
indirect access times
21. Context
switch
■ Storing/Restoring context information
■ Could flush core pipelines
■ Quite expensive
■ Process context switch also invalidates TLB buffers
■ Thread context switches are less expensive
22. Performing the Process Switch
■ Here, we are only concerned with how the kernel
performs a process switch.
■ Essentially, every process switch consists of two steps:
■ Switching the Page Global Directory to install a new
address space.
■ Switching the Kernel Mode stack and the hardware
context, which provides all the information needed by the
kernel to execute the new process, including the CPU
registers.
■ Description of logic and steps in <<Understanding Linux
Kernel>> takes 4-5 pages.
25. DB QUERY TIME IS QUITE CONSTANT
PROCESSING TIME IN NORMAL CASE 1-3 MS
AFTER A CONTEXT SWITCH MORE THAN 40MS
26. Tracing on kernel level
■ PythonVM with Thread execution
■ A lot of mutex operations (GIL effect)
■ A lot of gettimeoftheday() calls
■ I/O operations optimised - mmap
27. ■ Synchronisation cost is
core pipelines flushing
■ Thread structures are
memory expensive
■ Overhead increases in a
non linear fashion
28. Why do we need so many threads?
A lot of operations include remote calls (DB, other services)
Synchronous calls block thread execution
Classical Web Servers open a new thread for every incoming connection
10 000 connection problem
29. Code is more waiting than working
Usually DB can handle more load than applications
Common first steps to scaling is to increase app instances
For a lot of operations with blocking drivers and calls this is true
30. Pain – pain – pain
Threads creating and operating is expensive
Threads which is blocked by synchronous call can be rescheduled
Context switches more and more
32. Object Oriented Programming
■ Object encapsulates state
■ Methods can change internal
state
■ Object invariant can be
broken in case of concurrent
access
■ Semantics oriented on
nouns
me.walkTo(store.open((basket.add(milk)))
33. ■ Oriented on functional
composition
■ Functions have no side
effects
■ Have some performance
implications
■ Semantics oriented on verbs
Functional Programming
walk(open(store, add(basket, milk))))
34. Lazy execution
Functional composition means
creating pipelines
It’s defined before actual
computation
Declarative
More freedom to runtime
optimizations
38. Two sides
of a coin
Functional - building software
by composing pure functions,
avoiding shared state, mutable
data, and side-effects.
Reactive - asynchronous
programming paradigm
concerned with data streams
and the propagation of change
39. So what is
the main
goal?
To maximise the use rate
of modern multicore CPUs
and, more precisely, of the
threads competing for
their use.
41. Pulling vs. Pushing Data
Java streams – pull model
Reactive – push model
Reacting on propagating
changes instead of
iteration
42. Blocking
Processing
■ Mapping one execution
path on one thread is
ineffective
■ Threads are blocked
waiting for the I/O operation
to complete
■ Exit is to share threads
(relatively expensive and
scarce resources) among
lighter constructs
■ Like functional composition
of execution path
44. Let’s test
■ 2514 units of work (DB request and computation)
■ Scheduled once a minute
■ Execution on 8 thread pool with blocking driver takes 30-
35 sec on dedicated server
■ Let’s map it on threads and reactive pool
45. More than 2553 threads at start
CPU and memory
disturbances
A lot of 5 sec timeouts on driver side
46. 8 threads with non-blocking I/O
Less CPU and memory usage Uniform scheduled execution
47. One thread – one
execution path
One thread – many
execution paths
48. DB LOAD – ONE REACTIVE INSTANCE CAN LOAD UP
DB
49. Reactive is more than just async and
non blocking execution
Advanced time scheduling
Flexible Scheduling
Backpressure control
Resilience on errors
51. Recommendations
• Understanding the Linux Kernel [Book] - O'Reilly Media
• Optimizing Java - O'Reilly Media
• Seven Concurrency Models in Seven Weeks - The Pragmatic Bookshelf
• Learning Haskell
Editor's Notes
Registers: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers. These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores.
Memory Ordering Buffers (MOB): The MOB is comprised of a 64-entry load and 36-entry store buffer. These buffers are used to track in-flight operations while waiting on the cache sub-system as instructions get executed out-of-order. The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache. These buffers enable our fast processors to run without blocking while data is transferred to and from the cache sub-system. When the processor issues reads and writes they can can come back out-of-order. The MOB is used to disambiguate the load and store ordering for compliance to the published memory model.
Level 1 Cache: The L1 is a core-local cache split into separate 32K data and 32K instruction caches. Access time is 3 cycles and can be hidden as instructions are pipelined by the core for data already in the L1 cache.
Level 2 Cache: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache.
Level 3 Cache: The L3 cache is shared across all cores within a socket.
Main Memory: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss. This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer,
NUMA: In a multi-socket server we have non-uniform memory access. It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the QPI bus.
Associativity Levels
Caches are effectively hardware based hash tables. The hash function is usually a simple masking of some low-order bits for cache indexing. Hash tables need some means to handle a collision for the same slot.
The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3. This provides for rapid identification of the core containing a modified line when snooping for changes. The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns.
Cache CoherenceWith some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory. The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory.
To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states. The protocol Intel employs for this is MESIF, AMD employs a variant know as MOESI. Under the MESIF protocol each cache-line can be in 1 of the 5 following states:Modified: Indicates the cache-line is dirty and must be written back to memory at a later stage. When written back to main-memory the state transitions to Exclusive.
Exclusive: Indicates the cache-line is held exclusively and that it matches main-memory. When written to, the state then transitions to Modified. To achieve this state a Read-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies.
Shared: Indicates a clean copy of a cache-line that matches main-memory.
Invalid: Indicates an unused cache-line.
Forward: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system.
Translation Lookaside Buffers (TLB)
Besides general-purpose hardware caches, 80 × 86 processors include another cache called Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated.
Without the TLB, all virtual address lookups would take 16 cycles, even if the page table was held in the L1 cache. Performance would be unacceptable, so the TLB is basically essential for all modern chips.
TLBleed shows that, by monitoring hyper-thread activity through the TLB instead of caches, even with full cache isolation or protection policies in effect, information can still leak between processes
A context switch is the process by which the OS scheduler removes a currently running thread or task and replaces it with one that is waiting. There are several different types of context switch, but broadly speaking, they all involve swapping the executing instructions and the stack state of the thread.
A context switch can be a costly operation, whether between user threads or from user mode into kernel mode (sometimes called a mode switch). The latter case is particularly important, because a user thread may need to swap into kernel mode in order to perform some function partway through its time slice. However, this switch will force instruction and other caches to be emptied, as the memory areas accessed by the user space code will not normally have anything in common with the kernel.
For each process, Linux packs two different
data structures in a single per-process memory area: a small data structure linked to the process descriptor, namely the thread_info structure, and the Kernel Mode pro- cess stack.
A context switch into kernel mode will invalidate the TLBs and potentially other caches. When the call returns, these caches will have to be refilled, and so the effect of a kernel mode switch persists even after control has returned to user space. This causes the true cost of a system call to be masked, as can be seen
Runtime optimisations threading
The main feature of reactive programming for application-level components allows tasks to be executed asynchronously. process- ing streams of events in an asynchronous and nonblocking way is essential for maxi- mizing the use rate of modern multicore CPUs and, more precisely, of the threads competing for their use.
In non-blocking or asynchronous request processing, no thread is in waiting state. There is generally only one request thread receiving the request.
All incoming requests come with a event handler and call back information. Request thread delegates the incoming requests to a thread pool (generally small number of threads) which delegate the request to it’s handler function and immediately start processing other incoming requests from request thread.
When the handler function is complete, one of thread from pool collect the response and pass it to the call back function.