Concurrency, Performance, Parallelism

•Download as PPTX, PDF•

0 likes•45 views

Timetrix

Tallinn DevClub 2019

Software

Concurrency, Performance,
Parallelism
Aleksandr Tavgen

Concurrency and Parallelism
Quite often mixed together
But common problems are actual for both concurrency and paralellism

Process vs
Thread
Processes has separate
address space
Threads can share process
resources
For the kernel thread is
lightweight process which
shared under thread_group
Shared resources make pain

Race condition
When two or more agents try to
modify one shared state in
concurrent way

Locks and
synchronisation
There are bunch of methods for
synchronising state
Locks/barriers/semaphores
Lock free algorithms (Compare
and Set)
Lock free means performance
implications

Reordering of operations
Virtual registers
Operating through Load/Store Queue

Cache is a
source of truth
Registers
Memory Ordering Buffers
L1, L2 caches core local
L3 cache shared
Cache coherence protocol
needed

Memory Fence Barriers
L3 cache should be synchronized
Cache coherence protocol is executed
Memory fence instruction is issued
All reordering in core pipelines should be finished
Pipelines flush

Memory Model
Mapping Virtual addresses to
Physical requires computation and/or
Page Directories access
There is another cache for that
Translation Lookaside Buffer

Some attacks use
TLBleed
If address was accesses by some
thread then it is cached in TLB
Another thread can measure
indirect access times

Context switch
Storing/Restoring context information
Could flush core pipelines
Quite expensive
Process context switch also invalidates
TLB buffers
Thread context switches are less
expensive

Performing the Process Switch
Here, we are only concerned with how the kernel performs a process switch.
Essentially, every process switch consists of two steps:
Switching the Page Global Directory to install a new address space.
Switching the Kernel Mode stack and the hardware context, which provides all the information
needed by the kernel to execute the new process, including the CPU registers.
Description of logic and steps in <<Understanding Linux Kernel>> take 4-5 pages.

Performance impact
Invalidates caches and TLB
Caches should be refiled
Context switch cost has a longer
effect

Tracing Info
Average processing time 40-60ms
In case of increased processes
switch it increased up to 500-
1000ms
DB query takes 40-50 ms

DB query time is quite constant
Processing time in normal case (CPU/Memory access intensive) 1-3 ms
After a context switch more than 40ms

Tracing on kernel
level
PythonVM with Thread execution
A lot of mutex operations (GIL effect)
A lot of gettimeoftheday() calls
I/O operations optimised - mmap

Summary
Synchronisation cost is core
pipelines flushing
Thread structures are memory
expensive
Overhead is increasing in a non
linear fashion
10 000 connections problem

Why do we need so many threads?
A lot of operations include remote
calls (DB, other services)
Synchronous calls block thread
execution
Classical Web Servers open new
thread for every incoming connection
10 000 connection problem

Code is more waiting than working
Usually DB can handle more load
than applications
Common first steps to scaling is to
increase app instances
For a lot of operations with blocking
drivers and calls this is true

Pain – pain – pain
Threads creating is expensive
Threads operating is expensive
Threads which is blocked by
synchronous call can be rescheduled
Context switches more and more

Functional programming
Avoiding
Shared
Mutable State
Lazy execution
Functional
composition
Pattern
Matching

Avoiding Mutable
State
Object encapsulates state
Methods can change internal state
Object invariant can be broken in
case of concurrent access
Semantics oriented on nouns
me.buy(store.open(basket.add(milk)))

What's hot

LOAD BALANCING ALGORITHMStanmayshah95

Database replicationArslan111

Nondeterminism is unavoidable, but data races are pure evilracesworkshop

Erlang vs Ruby SOA Schedulingmarfeyh

Deep dive into Apache Kafka consumptionAlexandre Tamborrino

Comparing high availability solutions with percona xtradb cluster and percona...Marco Tusa

data replicationHassanein Alwan

Web Server Load BalancingMamdouh Tarabishi

Distributed transactionsAritra Das

Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...Continuent

Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Ontico

Load balancingVetri Deepika

Advanced off heap ipcPeter Lawrey

High Frequency Trading and NoSQL databasePeter Lawrey

Deterministic behaviour and performance in trading systemsPeter Lawrey

Id0115FNian

Open HFT libraries in @JavaPeter Lawrey

HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon

Low latency for high throughputPeter Lawrey

Fast SOA with Apache SynapsePaul Fremantle

What's hot (20)

LOAD BALANCING ALGORITHMS

Database replication

Nondeterminism is unavoidable, but data races are pure evil

Erlang vs Ruby SOA Scheduling

Deep dive into Apache Kafka consumption

Comparing high availability solutions with percona xtradb cluster and percona...

data replication

Web Server Load Balancing

Distributed transactions

Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...

Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...

Load balancing

Advanced off heap ipc

High Frequency Trading and NoSQL database

Deterministic behaviour and performance in trading systems

Id0115

Open HFT libraries in @Java

HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest

Low latency for high throughput

Fast SOA with Apache Synapse

Similar to Concurrency, Performance, Parallelism

Functional?  Reactive?  Why?Timetrix

(1) Briefly describe what overhead is associated with managing and s.pdfindiaartz

Parallel Processing (Part 2)Ajeng Savitri

Mapping Data Flows Perf Tuning April 2021Mark Kromer

Fabric Data Factory Pipeline Copy Perf Tips.pptxMark Kromer

Chap2 slidesashishmulchandani

SQL Server It Just Runs FasterBob Ward

Scalable Apache for Beginnerswebhostingguy

Efficient Shared Data in PerlPerrin Harkins

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

Database System ArchitecturesInformation Technology

Chapter04 newvmummaneni

ScalabilityAvailabilitywebuploader

Showdown: IBM DB2 versus Oracle Database for OLTPcomahony

Lecture1Asad Abbas

State transfer With GaleraMydbops

Developing high-performance network servers in LispVladimir Sedach

Sql server 2016 it just runs faster sql bits 2017 editionBob Ward

Multithreading 101Tim Penhey

CS 542 Parallel DBs, NoSQL, MapReduceJ Singh

Similar to Concurrency, Performance, Parallelism (20)

Functional?  Reactive?  Why?

(1) Briefly describe what overhead is associated with managing and s.pdf

Parallel Processing (Part 2)

Mapping Data Flows Perf Tuning April 2021

Fabric Data Factory Pipeline Copy Perf Tips.pptx

Chap2 slides

SQL Server It Just Runs Faster

Scalable Apache for Beginners

Efficient Shared Data in Perl

Stephan Ewen - Experiences running Flink at Very Large Scale

Database System Architectures

Chapter04 new

ScalabilityAvailability

Showdown: IBM DB2 versus Oracle Database for OLTP

Lecture1

State transfer With Galera

Developing high-performance network servers in Lisp

Sql server 2016 it just runs faster sql bits 2017 edition

Multithreading 101

CS 542 Parallel DBs, NoSQL, MapReduce

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

What is Fashion PLM and Why Do You Need ItWave PLM

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

software engineering Chapter 5 System modeling.pptxnada99848

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

React Server Component in Next.js by Hanief UtamaHanief Utama

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

chapter--4-software-project-planning.pptkotipi9215

Asset Management Software - InfographicHr365.us smith

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Recently uploaded (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Folding Cheat Sheet #4 - fourth in a series

What is Fashion PLM and Why Do You Need It

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

software engineering Chapter 5 System modeling.pptx

Automate your Kamailio Test Calls - Kamailio World 2024

Implementing Zero Trust strategy with Azure

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Cloud Data Center Network Construction - IEEE

React Server Component in Next.js by Hanief Utama

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Advancing Engineering with AI through the Next Generation of Strategic Projec...

The Evolution of Karaoke From Analog to App.pdf

Intelligent Home Wi-Fi Solutions | ThinkPalm

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

chapter--4-software-project-planning.ppt

Asset Management Software - Infographic

Der Spagat zwischen BIAS und FAIRNESS (2024)

Concurrency, Performance, Parallelism

1. Concurrency, Performance, Parallelism Aleksandr Tavgen

2. Concurrency and Parallelism Quite often mixed together But common problems are actual for both concurrency and paralellism

4. Process vs Thread Processes has separate address space Threads can share process resources For the kernel thread is lightweight process which shared under thread_group Shared resources make pain

5. Shared Mutable State Process space

6. Race condition When two or more agents try to modify one shared state in concurrent way

8. Locks and synchronisation There are bunch of methods for synchronising state Locks/barriers/semaphores Lock free algorithms (Compare and Set) Lock free means performance implications

9. Deadlock

10.

11. Let’s go deeper

12.

13. Multicore processor

14. Reordering of operations Virtual registers Operating through Load/Store Queue

15. Cache is a source of truth Registers Memory Ordering Buffers L1, L2 caches core local L3 cache shared Cache coherence protocol needed

16. Cache coherence

17. Memory Fence Barriers L3 cache should be synchronized Cache coherence protocol is executed Memory fence instruction is issued All reordering in core pipelines should be finished Pipelines flush

18. Memory Model Mapping Virtual addresses to Physical requires computation and/or Page Directories access There is another cache for that Translation Lookaside Buffer

19. Some attacks use TLBleed If address was accesses by some thread then it is cached in TLB Another thread can measure indirect access times

20. Context switch Storing/Restoring context information Could flush core pipelines Quite expensive Process context switch also invalidates TLB buffers Thread context switches are less expensive

21. Performing the Process Switch Here, we are only concerned with how the kernel performs a process switch. Essentially, every process switch consists of two steps: Switching the Page Global Directory to install a new address space. Switching the Kernel Mode stack and the hardware context, which provides all the information needed by the kernel to execute the new process, including the CPU registers. Description of logic and steps in <<Understanding Linux Kernel>> take 4-5 pages.

22. Performance impact Invalidates caches and TLB Caches should be refiled Context switch cost has a longer effect

23. Tracing Info Average processing time 40-60ms In case of increased processes switch it increased up to 500- 1000ms DB query takes 40-50 ms

24. DB query time is quite constant Processing time in normal case (CPU/Memory access intensive) 1-3 ms After a context switch more than 40ms

25. Tracing on kernel level PythonVM with Thread execution A lot of mutex operations (GIL effect) A lot of gettimeoftheday() calls I/O operations optimised - mmap

26. Summary Synchronisation cost is core pipelines flushing Thread structures are memory expensive Overhead is increasing in a non linear fashion 10 000 connections problem

27. Why do we need so many threads? A lot of operations include remote calls (DB, other services) Synchronous calls block thread execution Classical Web Servers open new thread for every incoming connection 10 000 connection problem

28. Code is more waiting than working Usually DB can handle more load than applications Common first steps to scaling is to increase app instances For a lot of operations with blocking drivers and calls this is true

29. Pain – pain – pain Threads creating is expensive Threads operating is expensive Threads which is blocked by synchronous call can be rescheduled Context switches more and more

30. Functional programming Avoiding Shared Mutable State Lazy execution Functional composition Pattern Matching

31. Avoiding Mutable State Object encapsulates state Methods can change internal state Object invariant can be broken in case of concurrent access Semantics oriented on nouns me.buy(store.open(basket.add(milk)))

32. Non-Blocking processing

Editor's Notes

Registers: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers. These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores. Memory Ordering Buffers (MOB): The MOB is comprised of a 64-entry load and 36-entry store buffer. These buffers are used to track in-flight operations while waiting on the cache sub-system as instructions get executed out-of-order. The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache. These buffers enable our fast processors to run without blocking while data is transferred to and from the cache sub-system. When the processor issues reads and writes they can can come back out-of-order. The MOB is used to disambiguate the load and store ordering for compliance to the published memory model. Level 1 Cache: The L1 is a core-local cache split into separate 32K data and 32K instruction caches. Access time is 3 cycles and can be hidden as instructions are pipelined by the core for data already in the L1 cache. Level 2 Cache: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache. Level 3 Cache: The L3 cache is shared across all cores within a socket. Main Memory: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss. This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer, NUMA: In a multi-socket server we have non-uniform memory access. It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the QPI bus. Associativity Levels Caches are effectively hardware based hash tables. The hash function is usually a simple masking of some low-order bits for cache indexing. Hash tables need some means to handle a collision for the same slot. The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3. This provides for rapid identification of the core containing a modified line when snooping for changes. The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns.
Cache CoherenceWith some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory. The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory. To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states. The protocol Intel employs for this is MESIF, AMD employs a variant know as MOESI. Under the MESIF protocol each cache-line can be in 1 of the 5 following states:Modified: Indicates the cache-line is dirty and must be written back to memory at a later stage. When written back to main-memory the state transitions to Exclusive. Exclusive: Indicates the cache-line is held exclusively and that it matches main-memory. When written to, the state then transitions to Modified. To achieve this state a Read-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies. Shared: Indicates a clean copy of a cache-line that matches main-memory. Invalid: Indicates an unused cache-line. Forward: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system.
To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states When a cache hit occurs, the cache controller behaves differently, depending on the access type. For a read operation, the controller selects the data from the cache line and transfers it into a CPU register; the RAM is not accessed and the CPU saves time, which is why the cache system was invented. For a write operation, the controller may implement one of two basic strategies called write-through and write-back. In a write-through, the controller always writes into both RAM and the cache line, effec- tively switching off the cache for write operations. In a write-back, which offers more immediate efficiency, only the cache line is updated and the contents of the RAM are left unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller writes the cache line back into RAM only when the CPU exe- cutes an instruction requiring a flush of cache entries or when a FLUSH hardware signal occurs (usually after a cache miss). When a cache miss occurs, the cache line is written to memory, if necessary, and the correct line is fetched from RAM into the cache entry.
Translation Lookaside Buffers (TLB) Besides general-purpose hardware caches, 80 × 86 processors include another cache called Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated. In a multiprocessor system, each CPU has its own TLB, called the local TLB of the CPU. Contrary to the hardware cache, the corresponding entries of the TLB need not be synchronized, because processes running on the existing CPUs may associate the same linear address with different physical ones. When the cr3 control register of a CPU is modified, the hardware automatically invalidates all entries of the local TLB, because a new set of page tables is in use and the TLBs are pointing to old data.
TLBleed shows that, by monitoring hyper-thread activity through the TLB instead of caches, even with full cache isolation or protection policies in effect, information can still leak between processes
A context switch is the process by which the OS scheduler removes a currently running thread or task and replaces it with one that is waiting. There are several different types of context switch, but broadly speaking, they all involve swapping the executing instructions and the stack state of the thread. A context switch can be a costly operation, whether between user threads or from user mode into kernel mode (sometimes called a mode switch). The latter case is particularly important, because a user thread may need to swap into kernel mode in order to perform some function partway through its time slice. However, this switch will force instruction and other caches to be emptied, as the memory areas accessed by the user space code will not normally have anything in common with the kernel. For each process, Linux packs two different data structures in a single per-process memory area: a small data structure linked to the process descriptor, namely the thread_info structure, and the Kernel Mode pro- cess stack.
A context switch into kernel mode will invalidate the TLBs and potentially other caches. When the call returns, these caches will have to be refilled, and so the effect of a kernel mode switch persists even after control has returned to user space. This causes the true cost of a system call to be masked, as can be seen
In non-blocking or asynchronous request processing, no thread is in waiting state. There is generally only one request thread receiving the request. All incoming requests come with a event handler and call back information. Request thread delegates the incoming requests to a thread pool (generally small number of threads) which delegate the request to it’s handler function and immediately start processing other incoming requests from request thread. When the handler function is complete, one of thread from pool collect the response and pass it to the call back function.

Concurrency, Performance, Parallelism

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Concurrency, Performance, Parallelism

Similar to Concurrency, Performance, Parallelism (20)

Recently uploaded

Recently uploaded (20)

Concurrency, Performance, Parallelism

Editor's Notes