SlideShare a Scribd company logo

Understanding of linux kernel memory model

Explain importance of memory model for parallel programming and describes linux kernel memory model.

1 of 73
Download to read offline
Understanding of
Linux Kernel Memory Model
SeongJae Park <sj38.park@gmail.com>
Great To Meet You
● SeongJae Park <sj38.park@gmail.com>
● Started contribution to Linux kernel just for fun since 2012
● Developing Guaranteed Contiguous Memory Allocator
○ Source code is available: https://lwn.net/Articles/634486/
● Maintaining Korean translation of Linux kernel memory barrier document
○ The translation has merged into mainline since v4.9-rc1
○ https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ko_KR/memory-barriers.txt?h=v4.9-rc1
https://www.linux.com/sites/lcom/files/styles/rendered_file/public/kernel-dev-2015.jpg?itok=9s3mV2Nc
Programmers in Multi-core Land
● Processor vendors changed their mind to increase number of cores instead of
clock speed a decade ago
○ Now, multi-core system is prevalent
○ Even octa-core portable bomb in your pocket, maybe?
http://www.gotw.ca/images/CPU.png
GIVE UP :’(
Programmers in Multi-core Land
● Processor vendors changed their mind to increase number of cores instead of
clock speed a decade ago
○ Now, multi-core system is prevalent
○ Even octa-core portable bomb in your pocket, maybe?
● As a result, the free lunch is over;
parallel programming is essential for high performance and scalability
http://www.gotw.ca/images/CPU.png
GIVE UP :’(
Writing Correct Parallel Program is Hard
● Compilers and processors are optimized for Instructions Per Cycle, not
programmer perspective goals such as response time or throughput of
meaningful (in people’s context) progress
Writing Correct Parallel Program is Hard
● Compilers and processors are optimized for Instructions Per Cycle, not
programmer perspective goals such as response time or throughput of
meaningful (in people’s context) progress
● Nature of parallelism is counter-intuitive
○ Time is relative, before and after is ambiguous, even simultaneous available
CPU 0 CPU 1
A = 1;
B = 1;
A = 2;
B = 2;
assert(B == 2 && A == 1)
CPU 1 assertion can be true on most parallel programming environments

Recommended

BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBshimosawa
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux KernelAdrian Huang
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KernelThomas Graf
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 

More Related Content

What's hot

WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307Linaro
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet FiltersKernel TLV
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisPaul V. Novarese
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...The Linux Foundation
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
XPDDS17:  Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...XPDDS17:  Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...The Linux Foundation
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Valeriy Kravchuk
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
Kdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisKdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisBuland Singh
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 

What's hot (20)

WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet Filters
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 
Memory model
Memory modelMemory model
Memory model
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
XPDDS17:  Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...XPDDS17:  Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
XPDDS17: Reworking the ARM GIC Emulation & Xen Challenges in the ARM ITS Emu...
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Kdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisKdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysis
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 

Viewers also liked

Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in LinuxRaghu Udiyar
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golangSeongJae Park
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)SeongJae Park
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.aviSeongJae Park
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot publicSeongJae Park
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocatorSeongJae Park
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologueSeongJae Park
 
Linux memorymanagement
Linux memorymanagementLinux memorymanagement
Linux memorymanagementpradeepelinux
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.aviSeongJae Park
 
(120513) #fitalk an introduction to linux memory forensics
(120513) #fitalk   an introduction to linux memory forensics(120513) #fitalk   an introduction to linux memory forensics
(120513) #fitalk an introduction to linux memory forensicsINSIGHT FORENSIC
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_dockerSeongJae Park
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golangSeongJae Park
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internallySeongJae Park
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamalKamal Maiti
 
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingLinux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingChinaNetCloud
 
Linux memory consumption
Linux memory consumptionLinux memory consumption
Linux memory consumptionhaish
 
Process' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/LinuxProcess' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/LinuxVarun Mahajan
 

Viewers also liked (20)

Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in Linux
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golang
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.avi
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot public
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologue
 
Linux memorymanagement
Linux memorymanagementLinux memorymanagement
Linux memorymanagement
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi
 
(120513) #fitalk an introduction to linux memory forensics
(120513) #fitalk   an introduction to linux memory forensics(120513) #fitalk   an introduction to linux memory forensics
(120513) #fitalk an introduction to linux memory forensics
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_docker
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golang
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internally
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingLinux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
 
Linux memory consumption
Linux memory consumptionLinux memory consumption
Linux memory consumption
 
Process' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/LinuxProcess' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/Linux
 

Similar to Understanding of linux kernel memory model

An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelSeongJae Park
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
chapter 1 -Basic Structure of Computers.pptx
chapter 1 -Basic Structure of Computers.pptxchapter 1 -Basic Structure of Computers.pptx
chapter 1 -Basic Structure of Computers.pptxjanani603976
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103Linaro
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...VMware Tanzu
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCinside-BigData.com
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
DockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing AureaDockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing AureaŁukasz Piątkowski
 
Compiler design notes phases of compiler
Compiler design notes phases of compilerCompiler design notes phases of compiler
Compiler design notes phases of compilerovidlivi91
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsScyllaDB
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory ModelSeongJae Park
 
Computer Programming In C.pptx
Computer Programming In C.pptxComputer Programming In C.pptx
Computer Programming In C.pptxchouguleamruta24
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
On component interface
On component interfaceOn component interface
On component interfaceLaurence Chen
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
 

Similar to Understanding of linux kernel memory model (20)

An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
chapter 1 -Basic Structure of Computers.pptx
chapter 1 -Basic Structure of Computers.pptxchapter 1 -Basic Structure of Computers.pptx
chapter 1 -Basic Structure of Computers.pptx
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103
 
C programming session9 -
C programming  session9 -C programming  session9 -
C programming session9 -
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
DockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing AureaDockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing Aurea
 
Compiler design notes phases of compiler
Compiler design notes phases of compilerCompiler design notes phases of compiler
Compiler design notes phases of compiler
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Embedded _c_
Embedded  _c_Embedded  _c_
Embedded _c_
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
 
Computer Programming In C.pptx
Computer Programming In C.pptxComputer Programming In C.pptx
Computer Programming In C.pptx
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
On component interface
On component interfaceOn component interface
On component interface
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 

More from SeongJae Park

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in goSeongJae Park
 
GCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorGCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorSeongJae Park
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalabilitySeongJae Park
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftestSeongJae Park
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using GolangSeongJae Park
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSSeongJae Park
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflectionSeongJae Park
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution beginSeongJae Park
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without TouchingSeongJae Park
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentationSeongJae Park
 

More from SeongJae Park (12)

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in go
 
GCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorGCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory Allocator
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalability
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftest
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using Golang
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCS
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflection
 
ash
ashash
ash
 
Hacktime for adk
Hacktime for adkHacktime for adk
Hacktime for adk
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution begin
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without Touching
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
 

Recently uploaded

Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)Jeffrey Haguewood
 
What are the Reasons for Tracking the Attendance of the Employees?
What are the Reasons for Tracking the Attendance of the Employees?What are the Reasons for Tracking the Attendance of the Employees?
What are the Reasons for Tracking the Attendance of the Employees?NYGGS Automation Suite
 
How AI is preventing account fraud at web scale
How AI is preventing account fraud at web scaleHow AI is preventing account fraud at web scale
How AI is preventing account fraud at web scaleAmir Moghimi
 
killing camp 주차장 나누기-2 topology sort.pdf
killing camp 주차장 나누기-2 topology sort.pdfkilling camp 주차장 나누기-2 topology sort.pdf
killing camp 주차장 나누기-2 topology sort.pdfssuser82c38d
 
LLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowLLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowNaoki (Neo) SATO
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Role of DevOps in SaaS product Development.pdf.pptx
Role of DevOps in SaaS product Development.pdf.pptxRole of DevOps in SaaS product Development.pdf.pptx
Role of DevOps in SaaS product Development.pdf.pptxMindInventory
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfssuser82c38d
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfssuser82c38d
 
Joseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about ArchitectureJoseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about ArchitectureHironori Washizaki
 
Cybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdfCybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdfCIOWomenMagazine
 
Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Dmitry Zinoviev
 
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이ssuser82c38d
 
Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!alttaskcom
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!Anthony Dahanne
 
Passbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managmentPassbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managmentThierry Gayet
 
The Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThe Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThousandEyes
 
Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)
 
What are the Reasons for Tracking the Attendance of the Employees?
What are the Reasons for Tracking the Attendance of the Employees?What are the Reasons for Tracking the Attendance of the Employees?
What are the Reasons for Tracking the Attendance of the Employees?
 
How AI is preventing account fraud at web scale
How AI is preventing account fraud at web scaleHow AI is preventing account fraud at web scale
How AI is preventing account fraud at web scale
 
killing camp 주차장 나누기-2 topology sort.pdf
killing camp 주차장 나누기-2 topology sort.pdfkilling camp 주차장 나누기-2 topology sort.pdf
killing camp 주차장 나누기-2 topology sort.pdf
 
eLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdfeLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdf
 
LLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowLLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flow
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Role of DevOps in SaaS product Development.pdf.pptx
Role of DevOps in SaaS product Development.pdf.pptxRole of DevOps in SaaS product Development.pdf.pptx
Role of DevOps in SaaS product Development.pdf.pptx
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdf
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdf
 
Joseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about ArchitectureJoseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about Architecture
 
Cybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdfCybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdf
 
Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)
 
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
 
Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!
 
Passbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managmentPassbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managment
 
The Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThe Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and Takeaways
 
Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019
 

Understanding of linux kernel memory model

  • 1. Understanding of Linux Kernel Memory Model SeongJae Park <sj38.park@gmail.com>
  • 2. Great To Meet You ● SeongJae Park <sj38.park@gmail.com> ● Started contribution to Linux kernel just for fun since 2012 ● Developing Guaranteed Contiguous Memory Allocator ○ Source code is available: https://lwn.net/Articles/634486/ ● Maintaining Korean translation of Linux kernel memory barrier document ○ The translation has merged into mainline since v4.9-rc1 ○ https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ko_KR/memory-barriers.txt?h=v4.9-rc1 https://www.linux.com/sites/lcom/files/styles/rendered_file/public/kernel-dev-2015.jpg?itok=9s3mV2Nc
  • 3. Programmers in Multi-core Land ● Processor vendors changed their mind to increase number of cores instead of clock speed a decade ago ○ Now, multi-core system is prevalent ○ Even octa-core portable bomb in your pocket, maybe? http://www.gotw.ca/images/CPU.png GIVE UP :’(
  • 4. Programmers in Multi-core Land ● Processor vendors changed their mind to increase number of cores instead of clock speed a decade ago ○ Now, multi-core system is prevalent ○ Even octa-core portable bomb in your pocket, maybe? ● As a result, the free lunch is over; parallel programming is essential for high performance and scalability http://www.gotw.ca/images/CPU.png GIVE UP :’(
  • 5. Writing Correct Parallel Program is Hard ● Compilers and processors are optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress
  • 6. Writing Correct Parallel Program is Hard ● Compilers and processors are optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress ● Nature of parallelism is counter-intuitive ○ Time is relative, before and after is ambiguous, even simultaneous available CPU 0 CPU 1 A = 1; B = 1; A = 2; B = 2; assert(B == 2 && A == 1) CPU 1 assertion can be true on most parallel programming environments
  • 7. Writing Correct Parallel Program is Hard ● Compilers and processors are optimized for Instructions Per Cycle, not programmer perspective goals such as response time or throughput of meaningful (in people’s context) progress ● Nature of parallelism is counter-intuitive ○ Time is relative, before and after is ambiguous, even simultaneous available ● C language developed with Uni-Processor assumption ■ “Et tu, C?” CPU 0 CPU 1 A = 1; B = 1; A = 2; B = 2; assert(B == 2 && A == 1) CPU 1 assertion can be true on most parallel programming environments
  • 8. TL; DR ● Memory operations can be reordered, merged, or discarded in any way unless it violates memory model defined behavior ○ In short, ``We’re all mad here’’ in parallel land ● Knowing memory model is important to write correct, fast, scalable parallel program https://ih1.redbubble.net/image.218483193.6460/sticker,220x200-pad,220x200,ffffff.u2.jpg
  • 9. Reordering for Better IPC[*] [*] IPC: Instructions Per Cycle
  • 10. Simple Program Execution Sequence ● Programmer writes program in C-like human readable language #include <stdio.h> int main(void) { printf("hello worldn"); return 0; }
  • 11. Simple Program Execution Sequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp Compiler
  • 12. Simple Program Execution Sequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 13. Simple Program Execution Sequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code ● Processor executes instruction sequence in the binary #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 14. Simple Program Execution Sequence ● Programmer writes program in C-like human readable language ● Compiler translates human readable code into assembly language ● Assembler generates executable binary from the assembly code ● Processor executes instruction sequence in the binary ○ Execution result is guaranteed to be same with sequential execution; In other words, the execution itself is not guaranteed to be sequential #include <stdio.h> int main(void) { printf("hello worldn"); return 0; } main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 3004 4000 0000 0000 ..>.....0.@..... 00000020: 4000 0000 0000 0000 d819 0000 0000 0000 @............... 00000030: 0000 0000 4000 3800 0900 4000 AssemblerCompiler
  • 15. Instruction Level Parallelism (ILP) ● Pipelining introduces instruction level parallelism ○ Each instruction is splitted up into a sequence of steps; Each step can be executed in parallel, instructions can be processed concurrently fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 2 Instruction 3
  • 16. Instruction Level Parallelism (ILP) ● Pipelining introduces instruction level parallelism ○ Each instruction is splitted up into a sequence of steps; Each step can be executed in parallel, instructions can be processed concurrently fetch decode execute fetch decode execute fetch decode execute If not pipelined, 3 cycles per instruction 3-depth pipeline can retire 3 instructions in 5 cycle: 1.7 cycles per instruction Instruction 1 Instruction 2 Instruction 3
  • 17. Dependent Instructions Harm ILP ● If an instruction is dependent to result of previous instruction, it should wait until the previous one finishes execution ○ E.g., a = b + c; d = a + b; fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 2 Instruction 3 In this case, instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction)
  • 18. Dependent Instructions Harm ILP ● If an instruction is dependent to result of previous instruction, it should wait until the previous one finishes execution ○ E.g., a = b + c; d = a + b; fetch decode execute fetch decode execute fetch decode execute In this case, instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) 7 cycles for 3 instructions: 2.3 cycles per instruction Instruction 1 Instruction 2 Instruction 3
  • 19. Instruction Reordering Helps Performance ● By reordering dependent instructions to be located in far away, total execution time can be shorten ● If the reordering is guaranteed to not change the result of the instruction sequence, it would be helpful for better performance fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 3 Instruction 2 instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction)
  • 20. Instruction Reordering Helps Performance ● By reordering dependent instructions to be located in far away, total execution time can be shorten ● If the reordering is guaranteed to not change the result of the instruction sequence, it would be helpful for better performance fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 3 Instruction 2 instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) By reordering instruction 2 and 3, total execution time can be shorten 6 cycles for 3 instructions: 2 cycles per instruction
  • 21. Reordering is Legal, Encouraged Behavior, But... ● If program causality is guaranteed, any reordering is legal ● Processors and compilers can make reordering of instructions for better IPC
  • 22. Reordering is Legal, Encouraged Behavior, But... ● If program causality is guaranteed, any reordering is legal ● Processors and compilers can make reordering of instructions for better IPC ● program causality defined with single processor environment ● IPC focused reordering doesn’t aware programmer perspective performance goals such as throughput or latency ● On Multi-processor system, reordering could harm not only correctness, but also performance https://s-media-cache-ak0.pinimg.com/originals/c2/1a/00/c21a007e0542f7da57dacd15a86d478d.jpg Throughput or latency? Instructions Per Cycle
  • 24. Time is Relative (E = MC2 ) ● Each CPU generates their events in their time, observes effects of events in relative time ● It is impossible to define absolute order of two concurrent events; Only relative observation order is possible CPU 1 CPU 2 CPU 3 CPU 4 Generated event 1 Generated event 2 Observed event 1 followed by event 2 I observed event 2 followed by event 1 Event Bus
  • 25. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 1 (event A) followed by an event of CPU 0 (event B) though CPU 1 observed event B before generating event A CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus
  • 26. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Event A
  • 27. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Event BEvent A
  • 28. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Event A Event B Busy… ;;;
  • 29. Relative Event Propagation of Hierarchical Memory ● Most system equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Seen Event A; Event B Event A
  • 30. Cache Coherency is Eventual ● It is well known that cache coherency protocol helps system memory consistency ● In actual, cache coherency guarantees eventual consistency only ● Every effect of each CPU will eventually become visible on all CPUs, but There’s no guarantee that they will become apparent in the same order on those other CPUs http://img06.deviantart.net/0663/i/2005/112/b/6/schrodinger__s_cat___2_by_firefoxcentral.jpg
  • 31. System with Store Buffer and Invalidation Queue ● Store Buffer and Invalidation Queue deliver effect of event but does not guarantee order of observation on each CPU CPU 0 Cache Store Buffer Invalidation Queue Memory CPU 1 Cache Store Buffer Invalidation Queue Bus
  • 33. C-language Doesn’t Know Multi-Processor ● By the time of initial C-language development, multi-processor was rare ● As a result, C-language has only few guarantees about memory operations on multi-processor ● Undefined behavior is allowed for undefined case https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming _Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag e,_First_Edition_Cover_(2).svg.png
  • 34. Compiler Optimizes Code ● Clever compilers try hard (really hard) to optimize code for high IPC (again, not for programmer perspective goals) ○ Converts small, private function to inline code ○ Reorder memory access code to minimize dependency ○ Simplify unnecessarily complex loops, ... ● Optimization uses term `Undefined behavior` as they want ○ It’s legal, but sometimes do insane things in programmer’s perspective ● Memory access reordering of compiler based on C-standard, which doesn’t aware multi-processor system, can generate unintended program ● Linux kernel uses compiler directives and volatile keyword to enforce memory ordering ● C11 has much more improvement, though
  • 36. Each Environment Provides Own Memory Model ● Memory Model defines how memory operations are generated, what effects it makes, how their effects will be propagated ● Each programming environment like Instruction Set Architecture, Programming language, Operating system, etc defines own memory model ○ Most modern language memory models (e.g., Golang, Rust, …) aware multi-processor http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
  • 37. Each ISA Provides Specific Memory Model ● Some architectures have stricter ordering enforcement rule than others ● PA-RISC CPUS is strictest, Alpha is weakest ● Because Linux kernel supports multiple architectures, it defines its memory model based on weakest one, Alpha https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
  • 38. Synchronization Primitives ● Though reordering and asynchronous effect propagation is legal, synchronization primitives are necessary to write human intuitive program ● Most memory model provides synchronization primitives like atomic instructions, memory barriers, etc https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
  • 39. Atomic Operations ● Atomic operations is configured with multiple sub operations ○ E.g., compare-and-swap, fetch-and-add, test-and-set ● Atomic operations have mutual exclusiveness ○ Middle state of atomic operation execution cannot be seen by others ○ It can be thought of as small critical section that protected by a global lock ● Almost every hardware supports basic atomic operations ● In general, atomic operations are expensive ○ Misuse of atomic operations can severely degrade performance and scalability http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
  • 40. Memory Barriers ● To allow synchronization of memory operations, memory model provides enforcement primitives, namely, memory barriers ● In general, memory barriers guarantee effects of memory operations issued before it to be propagated to other components (e.g., processor) in the system before memory operations issued after the barrier ● In general, memory barrier is expensive operation CPU 1 CPU 2 CPU 3 READ A; WRITE B; <BARRIER>; READ C; READ A, WRITE B, than READ C occurred WRITE B, READ A, than READ C occurred READ A and WRITE B can be reordered but READ C is guaranteed to be ordered after {READ A, WRITE B}
  • 41. Linux Kernel Memory Model ● Defined by weakest architecture, Alpha ○ Almost every combination of reordering is possible ● Provides rich set of atomic instructions ○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), ... ● Provides CPU level barriers, Compiler level barriers, semantic level barriers ○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ... ○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), … ○ Semantical barriers: ACQUIRE operations, RELEASE operations, … ○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt ● Because different barrier has different overhead, only necessary barrier should be used in necessary case for high performance and scalability
  • 43. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1)
  • 44. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1) :)
  • 45. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 A = 1; B = 1; while (B == 1) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1) :)
  • 46. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 B = 1; A = 1; while (B == 0) {} C = 1; Z = C; X = A; assert(z == 0 || x == 1)
  • 47. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor CPU 0 CPU 1 CPU 2 B = 1; A = 1; while (B == 0) {} C = 1; X = A; Z = C; assert(z == 0 || x == 1) ?????
  • 48. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor ● Memory barrier enforces operations specified before it appear as happened to operations specified after it CPU 0 CPU 1 CPU 2 A = 1; wmb(); B = 1; while (B == 0) {} mb(); C = 1; Z = C; rmb(); X = A; assert(z == 0 || x == 1)
  • 49. Memory Operation Reordering ● Memory Operation Reordering is totally LEGAL unless it breaks causality ● Both of CPU and Compiler can do it, even in Single Processor ● Memory barrier enforces operations specified before it appear as happened to operations specified after it ● In some architecture, even Transitivity is not guaranteed ○ Transitivity: B happened after A; C happened after B; then C happened after A CPU 0 CPU 1 CPU 2 A = 1; wmb(); B = 1; while (B == 0) {} mb(); C = 1; Z = C; rmb(); X = A; assert(z == 0 || x == 1)
  • 50. Transitivity for Scheduler and Workers Scheduler and each workers made consensus about order Scheduler Worker A Worker B Worker Z ... ... What time is it now? Night! Night! Night! ...
  • 51. Transitivity between Scheduler and Worker Scheduler and each workers made consensus about order Scheduler Worker A Worker B Worker Z ... ... Yay! ... Worker Z, all workers agreed that it’s night. Do bedmaking!
  • 52. Transitivity between Scheduler and Worker Scheduler and each workers made consensus about order But, worker B and worker Z didn’t made consensus Scheduler Worker A Worker B Worker Z ... ... !!?? ... Worker Z, I’m in afternoon! I didn’t tell you it’s night!
  • 54. Compiler Reordering Avoidance ● Compiler can remove loop entirely C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; i < 1000; i++) the_var++; } loop: .LFB106: .cfi_startproc addl $1000, the_var(%rip) ret .cfi_endproc .LFE106:
  • 55. Compiler Reordering Avoidance ● ACCESS_ONCE() is a compiler memory barrier implementation of Linux kernel ● Store to the_var could not be seen by others C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... movl the_var(%rip), %ecx .L175: ... addl $1, %eax ... cmpl $999, %edx jle .L175 movl %esi, the_var(%rip) .L170: rep ret
  • 56. Compiler Reordering Avoidance ● Still, store to `the_var` not issued for every iteration C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... movl the_var(%rip), %ecx .L175: ... addl $1, %eax ... cmpl $999, %edx jle .L175 movl %esi, the_var(%rip) .L170: rep ret
  • 57. Compiler Reordering Avoidance ● volatile enforces compiler to issue memory operation as programmer want (Note that it is not enforced to do DRAM access) ● However, repetitive LOAD may harm performance C code Assembly language code static volatile int the_var; void loop(void) { int i; for (i = 0; ACCESS_ONCE(i) < 1000; i++) the_var++; } loop: ... .L174: movl the_var(%rip), %edx ... addl $1, %edx movl %edx, the_var(%rip) ... cmpl $999, %edx jle .L174 .L170: rep ret .cfi_endproc
  • 58. Compiler Reordering Avoidance ● Complete memory barrier can help the case ● Does memory access once and uses register for loop condition check C code Assembly language code static int the_var; void loop(void) { int i; for (i = 0; i < 1000; i++) the_var++; barrier(); } loop: .LFB106: ... .L172: addl $1, the_var(%rip) subl $1, %eax jne .L172 rep ret .cfi_endproc
  • 60. Progress perception ● Code does issue LOAD and STORE, but… ● see_progress() can see no progress because change made by a processor propagates to other processor eventually, not immediately C code Assembly language code static int prgrs; void do_progress(void) { prgrs++; } void see_progress(void) { static int last_prgrs; static int seen; static int nr_seen; seen = prgrs; if (seen > last_prgrs) nr_seen++; last_prgrs = seen; } do_progress: ... addl $1, prgrs(%rip) ret ... see_progress: ... movl prgrs(%rip), %eax ... jle .L193 addl $1, nr_seen.5542(%rip) .L193: movl %eax, last_prgrs.5540(%rip) ret .cfi_endproc
  • 61. Progress perception ● Read barrier and write barrier helps the situation C code Assembly language code static int prgrs; void do_progress(void) { prgrs++; smp_wmb(); } void see_progress(void) { static int last_prgrs; static int seen; static int nr_seen; smp_rmb(); seen = prgrs; if (seen > last_prgrs) nr_seen++; last_prgrs = seen; } do_progress: ... addl $1, prgrs(%rip) ... sfence ret see_progress: ... lfence ... movl prgrs(%rip), %eax ... jle .L193 addl $1, nr_seen.5542(%rip) .L193: movl %eax, last_prgrs.5540(%rip)
  • 63. Neither Loads Nor Stores Are Reordered with Likes CPU 0 CPU 1 STORE 1 X STORE 1 Y R1 = LOAD Y R2 = LOAD X R1 == 1 && R2 == 0 impossible
  • 64. Stores Are Not Reordered With Earlier Loads CPU 0 CPU 1 R1 = LOAD X STORE 1 Y R2 = LOAD Y STORE 1 X R1 == 1 && R2 == 1 impossible
  • 65. Loads May Be Reordered with Earlier Stores to Different Locations CPU 0 CPU 1 STORE 1 X R1 = LOAD Y STORE 1 Y R2 = LOAD X R1 == 0 && R2 == 0 possible
  • 66. Intra-Processor Forwarding Is Allowed CPU 0 CPU 1 STORE 1 X R1 = LOAD X R2 = LOAD Y STORE 1 Y R3 = LOAD Y R4 = LOAD X R2 == 0 && R4 == 0 possible
  • 67. Stores Are Transitively Visible CPU 0 CPU 1 CPU 2 STORE 1 X R1 = LOAD X STORE 1 Y R2 = LOAD Y R3 = LOAD X R1 == 1 && R2 == 1 && R3 == 0 impossible
  • 68. Stores Are Seen in a Consistent Order by Others CPU 0 CPU 1 CPU 2 CPU 3 STORE 1 X STORE 1 Y R1 = LOAD X R2 = LOAD Y R3 = LOAD Y R4 = LOAD X R1 == 0 && R2 == 0 && R3 == 1 && R4 == 0 impossible
  • 69. X86 Memory Ordering Summary ● LOAD after LOAD never reordered ● STORE after STORE never reordered ● STORE after LOAD never reordered ● STOREs are transitively visible ● STOREs are seen in consistent order by others ● Intra-processor STORE forwarding is possible ● LOAD from different location after STORE may be reordered ● In short, quite reasonably strong enough ● For more detail, refer to `Intel Architecture Software Developer’s Manual`
  • 70. Summary ● Nature of Parallel Land is counter-intuitive ○ Cannot define order of events without interaction ○ Ordering rule is different for different environment ○ Memory model defines their ordering rule ○ In short, they’re all mad here ● For human-intuitive and correct program, interaction is necessary ○ Every memory model provides synchronization primitives like atomic instruction and memory barrier, etc ○ Such interaction is expensive in common ● Linux kernel memory model is based on weakest memory model, Alpha ○ Kernel programmers should assume Alpha when writing architecture independent code ○ Because of the expensive cost of synchronization primitives, programmer should use only necessary primitives on necessary location
  • 72. This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 73. These Slides Has Been Presented For ● SW Maestro 100+ Conference 2016 (http://onoffmix.com/event/57240) ● KOSSCON 2016 (https://kosscon.kr/2016)