This document discusses NUMA (Non-Uniform Memory Access) architectures and strategies for optimizing performance on NUMA systems. It proposes a user-space scheduler called UNAS that would automatically bind processes to NUMA nodes to improve locality and avoid unnecessary latency. UNAS would monitor NUMA topology and usage, distribute workloads across nodes, and migrate processes and memory as needed to maintain balance and minimize contention. The goal is to provide good performance out of the box without requiring manual tuning.
RTOS-MicroC/OS-II
It is a priority-based real-time multitasking operating system kernel for microprocessors, written mainly in the C programming language.It is intended for use in embedded systems.
INTRODUCTION: One popular RTOS For the E.S development is microCOS-ll for noncommercial use, it is free ware .jean J.labrosse designed it in 1992 it is well developed for a no.of applications.it is available from micrium it is popularly known as MUCOS (or) UCOS.
deep understanding of howto packet would reach to destination and basic understanding of network protocols.
learn howto manipulate with linux network and know howto manipulate with linux iptables.
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
keynote by Brendan Gregg. "Traditional performance monitoring makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. Modern BSD has advanced tracers and PMC tools, providing virtually endless metrics to aid performance analysis. It's time we really used them, but the problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems.
There's a new way to approach performance analysis that can guide you through the metrics. Instead of starting with traditional metrics and figuring out their use, you start with the questions you want answered then look for metrics to answer them. Methodologies can provide these questions, as well as a starting point for analysis and guidance for locating the root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, chain graphs, and more.
This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. Many methodologies will be discussed, from the production proven to the cutting edge, along with recommendations for their implementation on BSD systems. In general, you will learn to think differently about analyzing your systems, and make better use of the modern tools that BSD provides."
LOPSA SD 2014.03.27 Presentation on Linux Performance Analysis
An introduction using the USE method and showing how several tools fit into those resource evaluations.
RTOS-MicroC/OS-II
It is a priority-based real-time multitasking operating system kernel for microprocessors, written mainly in the C programming language.It is intended for use in embedded systems.
INTRODUCTION: One popular RTOS For the E.S development is microCOS-ll for noncommercial use, it is free ware .jean J.labrosse designed it in 1992 it is well developed for a no.of applications.it is available from micrium it is popularly known as MUCOS (or) UCOS.
deep understanding of howto packet would reach to destination and basic understanding of network protocols.
learn howto manipulate with linux network and know howto manipulate with linux iptables.
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
keynote by Brendan Gregg. "Traditional performance monitoring makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. Modern BSD has advanced tracers and PMC tools, providing virtually endless metrics to aid performance analysis. It's time we really used them, but the problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems.
There's a new way to approach performance analysis that can guide you through the metrics. Instead of starting with traditional metrics and figuring out their use, you start with the questions you want answered then look for metrics to answer them. Methodologies can provide these questions, as well as a starting point for analysis and guidance for locating the root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, chain graphs, and more.
This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. Many methodologies will be discussed, from the production proven to the cutting edge, along with recommendations for their implementation on BSD systems. In general, you will learn to think differently about analyzing your systems, and make better use of the modern tools that BSD provides."
LOPSA SD 2014.03.27 Presentation on Linux Performance Analysis
An introduction using the USE method and showing how several tools fit into those resource evaluations.
linux monitoring and performance tunning iman darabi
howto monitor linux server? what metrics are important when monitor server? what is related between metrics and monitoring tools? what are basic linux server optimization ? howto optimize ?
ACM Applicative System Methodology 2016Brendan Gregg
Video: https://youtu.be/eO94l0aGLCA?t=3m37s . Talk by Brendan Gregg for ACM Applicative 2016
"System Methodology - Holistic Performance Analysis on Modern Systems
Traditional systems performance engineering makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. For modern systems, we can choose the metrics, and can choose ones we need to support new holistic performance analysis methodologies. These methodologies provide faster, more accurate, and more complete analysis, and can provide a starting point for unfamiliar systems.
Methodologies are especially helpful for modern applications and their workloads, which can pose extremely complex problems with no obvious starting point. There are also continuous deployment environments such as the Netflix cloud, where these problems must be solved in shorter time frames. Fortunately, with advances in system observability and tracers, we have virtually endless custom metrics to aid performance analysis. The problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems.
System methodologies provide a starting point for analysis, as well as guidance for quickly moving through the metrics to root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, and more.
This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. The focus is on single systems (any operating system), including single cloud instances, and quickly locating performance issues or exonerating the system. Many methodologies will be discussed, along with recommendations for their implementation, which may be as documented checklists of tools, or custom dashboards of supporting metrics. In general, you will learn to think differently about your systems, and how to ask better questions."
Sched-freq is the integration of the scheduler with cpufreq to yield scheduler-aware CPU frequency management. The purpose of the project will be briefly explained. The current design will be presented including things like the location of hooks in the scheduler and the implicit policy created by those hooks. Sched-freq will be contrasted with other cpufreq governors such as ondemand and interactive. The latest test results will be presented, along with next steps.
Video: https://www.youtube.com/watch?v=uibLwoVKjec . Talk by Brendan Gregg for Sysdig CCWFS 2016. Abstract:
"You have a system with an advanced programmatic tracer: do you know what to do with it? Brendan has used numerous tracers in production environments, and has published hundreds of tracing-based tools. In this talk he will share tips and know-how for creating CLI tracing tools and GUI visualizations, to solve real problems effectively. Programmatic tracing is an amazing superpower, and this talk will show you how to wield it!"
Operating systems control our hardware and run our applications on them, how can we monitor linux operating system?
When we speak about monitoring it's the matter of all hardwares and users.
The slides below will describe the very common command line basic tools for monitoring.
CPUs embed a performance monitoring unit (PMU) which we can use to collect data on cache usage (hit/miss) branch predictions, instructions, cycles, etc. those counters can be accessed in Java thanks to the overseer library.
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
Talk for USENIX LISA 2016 by Brendan Gregg.
"Linux 4.x Tracing Tools: Using BPF Superpowers
The Linux 4.x series heralds a new era of Linux performance analysis, with the long-awaited integration of a programmable tracer: Enhanced BPF (eBPF). Formally the Berkeley Packet Filter, BPF has been enhanced in Linux to provide system tracing capabilities, and integrates with dynamic tracing (kprobes and uprobes) and static tracing (tracepoints and USDT). This has allowed dozens of new observability tools to be developed so far: for example, measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. Tracing superpowers have finally arrived.
In this talk I'll show you how to use BPF in the Linux 4.x series, and I'll summarize the different tools and front ends available, with a focus on iovisor bcc. bcc is an open source project to provide a Python front end for BPF, and comes with dozens of new observability tools (many of which I developed). These tools include new BPF versions of old classics, and many new tools, including: execsnoop, opensnoop, funccount, trace, biosnoop, bitesize, ext4slower, ext4dist, tcpconnect, tcpretrans, runqlat, offcputime, offwaketime, and many more. I'll also summarize use cases and some long-standing issues that can now be solved, and how we are using these capabilities at Netflix."
Talk for SCaLE13x. Video: https://www.youtube.com/watch?v=_Ik8oiQvWgo . Profiling can show what your Linux kernel and appliacations are doing in detail, across all software stack layers. This talk shows how we are using Linux perf_events (aka "perf") and flame graphs at Netflix to understand CPU usage in detail, to optimize our cloud usage, solve performance issues, and identify regressions. This will be more than just an intro: profiling difficult targets, including Java and Node.js, will be covered, which includes ways to resolve JITed symbols and broken stacks. Included are the easy examples, the hard, and the cutting edge.
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
linux monitoring and performance tunning iman darabi
howto monitor linux server? what metrics are important when monitor server? what is related between metrics and monitoring tools? what are basic linux server optimization ? howto optimize ?
ACM Applicative System Methodology 2016Brendan Gregg
Video: https://youtu.be/eO94l0aGLCA?t=3m37s . Talk by Brendan Gregg for ACM Applicative 2016
"System Methodology - Holistic Performance Analysis on Modern Systems
Traditional systems performance engineering makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. For modern systems, we can choose the metrics, and can choose ones we need to support new holistic performance analysis methodologies. These methodologies provide faster, more accurate, and more complete analysis, and can provide a starting point for unfamiliar systems.
Methodologies are especially helpful for modern applications and their workloads, which can pose extremely complex problems with no obvious starting point. There are also continuous deployment environments such as the Netflix cloud, where these problems must be solved in shorter time frames. Fortunately, with advances in system observability and tracers, we have virtually endless custom metrics to aid performance analysis. The problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems.
System methodologies provide a starting point for analysis, as well as guidance for quickly moving through the metrics to root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, and more.
This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. The focus is on single systems (any operating system), including single cloud instances, and quickly locating performance issues or exonerating the system. Many methodologies will be discussed, along with recommendations for their implementation, which may be as documented checklists of tools, or custom dashboards of supporting metrics. In general, you will learn to think differently about your systems, and how to ask better questions."
Sched-freq is the integration of the scheduler with cpufreq to yield scheduler-aware CPU frequency management. The purpose of the project will be briefly explained. The current design will be presented including things like the location of hooks in the scheduler and the implicit policy created by those hooks. Sched-freq will be contrasted with other cpufreq governors such as ondemand and interactive. The latest test results will be presented, along with next steps.
Video: https://www.youtube.com/watch?v=uibLwoVKjec . Talk by Brendan Gregg for Sysdig CCWFS 2016. Abstract:
"You have a system with an advanced programmatic tracer: do you know what to do with it? Brendan has used numerous tracers in production environments, and has published hundreds of tracing-based tools. In this talk he will share tips and know-how for creating CLI tracing tools and GUI visualizations, to solve real problems effectively. Programmatic tracing is an amazing superpower, and this talk will show you how to wield it!"
Operating systems control our hardware and run our applications on them, how can we monitor linux operating system?
When we speak about monitoring it's the matter of all hardwares and users.
The slides below will describe the very common command line basic tools for monitoring.
CPUs embed a performance monitoring unit (PMU) which we can use to collect data on cache usage (hit/miss) branch predictions, instructions, cycles, etc. those counters can be accessed in Java thanks to the overseer library.
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
Talk for USENIX LISA 2016 by Brendan Gregg.
"Linux 4.x Tracing Tools: Using BPF Superpowers
The Linux 4.x series heralds a new era of Linux performance analysis, with the long-awaited integration of a programmable tracer: Enhanced BPF (eBPF). Formally the Berkeley Packet Filter, BPF has been enhanced in Linux to provide system tracing capabilities, and integrates with dynamic tracing (kprobes and uprobes) and static tracing (tracepoints and USDT). This has allowed dozens of new observability tools to be developed so far: for example, measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. Tracing superpowers have finally arrived.
In this talk I'll show you how to use BPF in the Linux 4.x series, and I'll summarize the different tools and front ends available, with a focus on iovisor bcc. bcc is an open source project to provide a Python front end for BPF, and comes with dozens of new observability tools (many of which I developed). These tools include new BPF versions of old classics, and many new tools, including: execsnoop, opensnoop, funccount, trace, biosnoop, bitesize, ext4slower, ext4dist, tcpconnect, tcpretrans, runqlat, offcputime, offwaketime, and many more. I'll also summarize use cases and some long-standing issues that can now be solved, and how we are using these capabilities at Netflix."
Talk for SCaLE13x. Video: https://www.youtube.com/watch?v=_Ik8oiQvWgo . Profiling can show what your Linux kernel and appliacations are doing in detail, across all software stack layers. This talk shows how we are using Linux perf_events (aka "perf") and flame graphs at Netflix to understand CPU usage in detail, to optimize our cloud usage, solve performance issues, and identify regressions. This will be more than just an intro: profiling difficult targets, including Java and Node.js, will be covered, which includes ways to resolve JITed symbols and broken stacks. Included are the easy examples, the hard, and the cutting edge.
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and Accelerated Computing (GPU and FPGA) instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
advanced computer architecture unit 1 notes. topics covered are Parallel Computer Models : The state of computing, Multiprocessors and Multi-computers, Multi vector and SIMD Computers, PRAM and VLSI Models, Architectural Development Tracks
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
Your AMI is one of the core foundations for running applications and services effectively on Amazon EC2. In this session, you'll learn how to optimize your AMI, including how you can measure and diagnose system performance and tune parameters for improved CPU and network performance. We'll cover application-specific examples from Netflix on how optimized AMIs can lead to improved performance.
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and Accelerated Computing (GPU and FPGA) instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
2. Evolution of Memory Architecture
System software running on a NUMA architecture needs to be aware of the processor
topology in order to properly allocate memory and processes to maximize performance.
UMA
(Uniform Memory Architecture)
NUMA
(Non-Uniform Memory Architecture)
CPU
CPU CPU
CPU
Memory
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Interconnection Network
Nodes, Sockets, Cores, Threads
Operating System
Run-time System
2
3. Must end users be NUMA-aware?
Users must be aware of PCIe device slot placement
Optimal NUMA tuning is not yet performed by the OS
Persistent tuning is a non-trivial task
Performance challenges are changing faster than tools
Unfortunately, yes.
3
4. Motivation
IBM Cell Broadband Engine ccNUMA architecture (SGI Altix 3000)
ccNUMA architecture (IBM)
24 Gb/s24 Gb/s
Actually, server administrators have different OS knowledge.
Therefore, all administrators can not manage the NUMA server for the optimizing
memory utilization and performance in real environment.
4
5. What is the goal mainly?
Propose the user-space automatic service daemon to
provide the best performance, by avoiding unnecessary
latency. (For newbies at the server administration)
Binding Processes to NUMA Nodes Automatically in user-space
Automatically Improve NUMA System Performance with the
Proposed system
Also, support the manual setting infrastructure like the
existing system for the veteran.
5
6. Related work
Approach Pros. Cons.
Autonuma
• Kernel-space
• Purely OS approach
• Not aggressive approach
Numa Balancer
• Kernel-space
• Purely OS approach
• Not aggressive approach
Melgorman’s
MM
• Kernel-space
• Purely OS approach
• Not aggressive approach
Sergey
• User-space approach
• Aggressive approach
• Manual configuration
• Damage of memory utilization
because of Affinity method
• It’s no memory scheduler
UNAS
• Automatic
• User-space approach
• Easy to manage
• Aggressive approach
• Don’t follow-up in-depth Memory
management
• It’s no memory scheduler
6
7. What is UNAS?
UNAS is a user-space scheduler that monitors NUMA topology and usage
UNAS distributes loads for good locality for the purpose of providing the best
performance, by avoiding unnecessary latency.
Goal of UNAS is to automatically bind processes to NUMA nodes as GPL license.
Initial allocation New NUMA-aware allocation
7
8. Design
User-space
Scheduler
NUMA List
. . . .
Monitor Reporter
Collect
NUMA
Specific Data
Task
ProcFS&SysFS User-Space Scheduler
User-Space Runtime Monitor
User-space
Kernel-
space
NUMA Memory Node
8
9. Proposed Scheduler
Algorithm 1. Monitor: Runtime monitoring mechanism
1. Create a new thread for receiving and dealing with the run-time monitoring data
2. Repeat monitoring until NUMA-aware user-space scheduler stop
3. Sleep for an NUMA specific data (from /proc/stat)
4. Collect the monitoring report
5. End Repeat loop
Algorithm 2. Reporter: Collected NUMA specified data reporting mechanism
Input: run-time monitoring data
1. Repeat until runtime monitoring mechanism stop
2. Receiving data and filtering them from online monitoring
3. Collect NUMA specific data
4. If loading of system is unbalanced or behavior of the processes changed or powerful core
is idle
5. Computing the Run-time speedup factor
6. Sorting the process NUMA list by multi-core speedup factor
7. Computing the contention degradation factor
8. Sorting the process NUMA list by contention degradation factor
9. Sending signal to trigger schedule
10. End if
11. End Repeat loop
9
10. Proposed Scheduler
Algorithm 3. User-space Scheduler: Automatic NUMA aware scheduling
Input: NUMA list
1. Computing the number of powerful core candidate based on load balanced memory policy
2. Retrieving suitable processes to be scheduled on powerful cores from NUMA list
3. Setting static CPU pin from manual input of administrator
4. If retrieved processes != current processes on powerful cores
5. Migrate the processes
6. End if
7. If current resource contention degradation is too big
8. Scatter the processes with heavy contention
9. Calculating degradation factor in order to minimize resource contention degradation
10. Migrate the processes and the its sticky pages
11. End if
10
11. Flowchart of Proposed Scheduler
Monitoring the characteristics
of NUMA
Setting static CPU pin
manually
Allocate Memory based on
monitoring info.
Reallocate for optimal
allocation dynamically
Per 10
Seconds
Manual Setting by
Administrator
END
START New allocation
Re-allocation
• /proc/<pid>/stat
• /proc/<pid>/numa_maps
• /sys/class/numa_topology
11
12. Implementation of UNAS
Content Default Value
Max Nodes 256
Max CPUs 2,048
C
P
U
CPU Threshold 30
CPU Scale Factor 100
Memory Threshold 300 MB
Implementation for ( ; ; ) {
if (NUMA) {
update_processes(); "/proc/%s/stat"
interval = manage_loads(); bind_process_and_migrate_memory()
time_interval(10);
}
} invain@numa-server:/proc/2$> cat ./stat
2028 (Xorg) S 1987 2028 2028 1031 2028 4202752 8778 0 41 0 13259 443 0 0 20 0 9 0
2644 238051328 6541 18446744073709551615 1 1 0 0 0 0 0 4096 1367369423
18446744073709551615 0 0 17 53 0 0 30 0 0
Priority
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep heap
7f9219662000 default heap anon=2979 dirty=2979 N0=2 N1=2975 N2=2
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep stack
7fffb6601000 default stack anon=37 dirty=37 N1=37
CPU (0~79)
#Node of
Heap
#Node of
Stack
12
17. References
Auto NUMA Ver 26 : http://lwn.net/Articles/488709/
Peter Zijlstra's NUMA scheduling patch set : http://lwn.net/Articles/486858/
NUMA system call : get_mempolicy(2), mbind(2), igrate_pages(2), move_pages(2), and
set_mempolicy(2).
Libnuma : Link with -lnuma to get the system call definitions. The numactl package is av
ailable at ftp://oss.sgi.com/www/projects/libnuma/download/. Applications should not us
e these system calls directly. The higher level interface provided by the numa(3) function
s in the numactl package is recommended.
RHEL 6.3 : Redhat Enterprise Linux ver 6.3; http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/resource_management-
tp.html
Sergey Blagodurov et al., "A Case for NUMA-aware Contention Management on
Multicore Systems," USENIX ATC 2011
Yinan Li, et al., "NUMA-Aware Algorithms: the Case of Data Shuffling," CIDR, 2013
17
18. Conclusion
All administrators can not easily manage the NUMA server for the
optimizing memory utilization and performance in real environment.
UNAS is a standalone daemon that monitors NUMA topology and
usage in real-time.
UNAS distributes loads for good locality for the purpose of providing the best
performance.
UNAS automatically bind processes to NUMA nodes as Beer license
18
21. Migrating pages to optimize NUMA locality
Content NUMASCHED SchedNUMA AutoNUMA Automatic NUMA
balancing
Who Lee Schermerhorn (HP) Peter Zijlstra (REDHAT) Andrea Arcangeli (REDHAT) Mel Gorman (SUSE)
Progress RFC (at LPC2010) PATCH v1 (Rewrite
NUMASCHED)
Alpha 23 Since Linux 3.8
Key
factors
lazy/auto-migration Allowing processes to be
put into "NUMA groups"
that will share the same
home node.
Scanning / auto-migration SCHEDNUMA +
AUTONUMA
Details Migration when a fault
handler such as
do_swap_page()
finds a cached page
with zero mappings
1) allowing processes to be
put into "NUMA groups :
int numa_mbind ();
2) the NUMA group
identified by ng_id : int
numa_tbind();
Pagetable scanner /
knuma_migrated per NUMA
node queues
Migration when fault
Migration w/ PTE
(Migrate On Reference Of
pte_numa Node
[MORON])
Operati
ons
automatic page
migration for
Virtualization on
X86_64
new system calls
http://lwn.net/Articles/48
6850/
git clone --reference linux -b
autonuma-alpha10
git://git.kernel.org/pub/scm/li
nux/kernel/git/andrea/aa.git
git://git.kernel.org/pub/s
cm/linux/kernel/git/mel/
linux-balancenuma.git
mm-balancenuma-v4r38
Eval 1) Refer to NUMA
Balancer
55% faster than mainline
(Dan Smith )
35% faster than mainline
(Dan Smith )
mmtest utility
21
• GOAL : Keep processes and their memory together on the same NUMA node.
Support for automatically migrating pages to optimize NUMA locality
• Eval 1): https://lkml.org/lkml/2012/3/20/508
• Autonuma benchmark ver 0.1 - git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git
• mmtest by Mel Gorman - http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz (autonumabench)
Time
23. Appendix /proc/[number]/numa_maps (since Linux
2.6.14)
This file displays information about a process's NUMA
memory policy and allocation.
Each line contains information about a memory range used by
the process, displaying--among other information--the effective
memory policy for that
memory range and on which nodes the pages have been
allocated.
numa_maps is a read-only file. When /proc/<pid>/numa_maps
is read, the kernel will scan the virtual address space of the
process and report how memory isused. One line is displayed
for each unique memory range of the process.
23
25. Appendix “numactl” Sample
numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 a
nd 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments Run bi
g database with its memory interleaved on all CPUs.
numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory a
llocated on node 0 and 1.
numactl --cpubind=0 --membind=0,1 -- process -l Run process as above, but with an
option (-l) that would be confused with a numactl option.
numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting st
ate.
numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared
memory regiion specified by /tmp/shmkey over all nodes.
numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the s
econd gigabyte in the tmpfs file /dev/shm/A to node 1.
numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to
the default localalloc policy
25