4. Process vs
Thread
Processes has separate
address space
Threads can share process
resources
For the kernel thread is
lightweight process which
shared under thread_group
Shared resources make pain
8. Locks and
synchronisation
There are bunch of methods for
synchronising state
Locks/barriers/semaphores
Lock free algorithms (Compare
and Set)
Lock free means performance
implications
17. Memory Fence Barriers
L3 cache should be synchronized
Cache coherence protocol is executed
Memory fence instruction is issued
All reordering in core pipelines should be finished
Pipelines flush
18. Memory Model
Mapping Virtual addresses to
Physical requires computation and/or
Page Directories access
There is another cache for that
Translation Lookaside Buffer
19. Some attacks use
TLBleed
If address was accesses by some
thread then it is cached in TLB
Another thread can measure
indirect access times
20. Context switch
Storing/Restoring context information
Could flush core pipelines
Quite expensive
Process context switch also invalidates
TLB buffers
Thread context switches are less
expensive
21. Performing the Process Switch
Here, we are only concerned with how the kernel performs a process switch.
Essentially, every process switch consists of two steps:
Switching the Page Global Directory to install a new address space.
Switching the Kernel Mode stack and the hardware context, which provides all the information
needed by the kernel to execute the new process, including the CPU registers.
Description of logic and steps in <<Understanding Linux Kernel>> take 4-5 pages.
23. Tracing Info
Average processing time 40-60ms
In case of increased processes
switch it increased up to 500-
1000ms
DB query takes 40-50 ms
24. DB query time is quite constant
Processing time in normal case (CPU/Memory access intensive) 1-3 ms
After a context switch more than 40ms
25. Tracing on kernel
level
PythonVM with Thread execution
A lot of mutex operations (GIL effect)
A lot of gettimeoftheday() calls
I/O operations optimised - mmap
26. Summary
Synchronisation cost is core
pipelines flushing
Thread structures are memory
expensive
Overhead is increasing in a non
linear fashion
10 000 connections problem
27. Why do we need so many threads?
A lot of operations include remote
calls (DB, other services)
Synchronous calls block thread
execution
Classical Web Servers open new
thread for every incoming connection
10 000 connection problem
28. Code is more waiting than working
Usually DB can handle more load
than applications
Common first steps to scaling is to
increase app instances
For a lot of operations with blocking
drivers and calls this is true
29. Pain – pain – pain
Threads creating is expensive
Threads operating is expensive
Threads which is blocked by
synchronous call can be rescheduled
Context switches more and more
31. Avoiding Mutable
State
Object encapsulates state
Methods can change internal state
Object invariant can be broken in
case of concurrent access
Semantics oriented on nouns
me.buy(store.open(basket.add(milk)))
Registers: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers. These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores.
Memory Ordering Buffers (MOB): The MOB is comprised of a 64-entry load and 36-entry store buffer. These buffers are used to track in-flight operations while waiting on the cache sub-system as instructions get executed out-of-order. The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache. These buffers enable our fast processors to run without blocking while data is transferred to and from the cache sub-system. When the processor issues reads and writes they can can come back out-of-order. The MOB is used to disambiguate the load and store ordering for compliance to the published memory model.
Level 1 Cache: The L1 is a core-local cache split into separate 32K data and 32K instruction caches. Access time is 3 cycles and can be hidden as instructions are pipelined by the core for data already in the L1 cache.
Level 2 Cache: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache.
Level 3 Cache: The L3 cache is shared across all cores within a socket.
Main Memory: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss. This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer,
NUMA: In a multi-socket server we have non-uniform memory access. It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the QPI bus.
Associativity Levels
Caches are effectively hardware based hash tables. The hash function is usually a simple masking of some low-order bits for cache indexing. Hash tables need some means to handle a collision for the same slot.
The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3. This provides for rapid identification of the core containing a modified line when snooping for changes. The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns.
Cache CoherenceWith some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory. The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory.
To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states. The protocol Intel employs for this is MESIF, AMD employs a variant know as MOESI. Under the MESIF protocol each cache-line can be in 1 of the 5 following states:Modified: Indicates the cache-line is dirty and must be written back to memory at a later stage. When written back to main-memory the state transitions to Exclusive.
Exclusive: Indicates the cache-line is held exclusively and that it matches main-memory. When written to, the state then transitions to Modified. To achieve this state a Read-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies.
Shared: Indicates a clean copy of a cache-line that matches main-memory.
Invalid: Indicates an unused cache-line.
Forward: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system.
To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states
When a cache hit occurs, the cache controller behaves differently, depending on the access type. For a read operation, the controller selects the data from the cache line and transfers it into a CPU register; the RAM is not accessed and the CPU saves time, which is why the cache system was invented. For a write operation, the controller may implement one of two basic strategies called write-through and write-back. In a write-through, the controller always writes into both RAM and the cache line, effec- tively switching off the cache for write operations. In a write-back, which offers more immediate efficiency, only the cache line is updated and the contents of the RAM are left unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller writes the cache line back into RAM only when the CPU exe- cutes an instruction requiring a flush of cache entries or when a FLUSH hardware signal occurs (usually after a cache miss).
When a cache miss occurs, the cache line is written to memory, if necessary, and the correct line is fetched from RAM into the cache entry.
Translation Lookaside Buffers (TLB)
Besides general-purpose hardware caches, 80 × 86 processors include another cache called Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated.
In a multiprocessor system, each CPU has its own TLB, called the local TLB of the CPU. Contrary to the hardware cache, the corresponding entries of the TLB need not be synchronized, because processes running on the existing CPUs may associate the same linear address with different physical ones.
When the cr3 control register of a CPU is modified, the hardware automatically invalidates all entries of the local TLB, because a new set of page tables is in use and the TLBs are pointing to old data.
TLBleed shows that, by monitoring hyper-thread activity through the TLB instead of caches, even with full cache isolation or protection policies in effect, information can still leak between processes
A context switch is the process by which the OS scheduler removes a currently running thread or task and replaces it with one that is waiting. There are several different types of context switch, but broadly speaking, they all involve swapping the executing instructions and the stack state of the thread.
A context switch can be a costly operation, whether between user threads or from user mode into kernel mode (sometimes called a mode switch). The latter case is particularly important, because a user thread may need to swap into kernel mode in order to perform some function partway through its time slice. However, this switch will force instruction and other caches to be emptied, as the memory areas accessed by the user space code will not normally have anything in common with the kernel.
For each process, Linux packs two different
data structures in a single per-process memory area: a small data structure linked to the process descriptor, namely the thread_info structure, and the Kernel Mode pro- cess stack.
A context switch into kernel mode will invalidate the TLBs and potentially other caches. When the call returns, these caches will have to be refilled, and so the effect of a kernel mode switch persists even after control has returned to user space. This causes the true cost of a system call to be masked, as can be seen
In non-blocking or asynchronous request processing, no thread is in waiting state. There is generally only one request thread receiving the request.
All incoming requests come with a event handler and call back information. Request thread delegates the incoming requests to a thread pool (generally small number of threads) which delegate the request to it’s handler function and immediately start processing other incoming requests from request thread.
When the handler function is complete, one of thread from pool collect the response and pass it to the call back function.