UNAS-20140123-1800

1/23/2014 12:15 PM
Geunsik Lim
Sungkyunkwan University
Samsung Electronics

Evolution of Memory Architecture
 System software running on a NUMA architecture needs to be aware of the processor
topology in order to properly allocate memory and processes to maximize performance.
UMA
(Uniform Memory Architecture)
NUMA
(Non-Uniform Memory Architecture)
CPU
CPU CPU
CPU
Memory
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Interconnection Network
Nodes, Sockets, Cores, Threads
Operating System
Run-time System
2

Must end users be NUMA-aware?
 Users must be aware of PCIe device slot placement
 Optimal NUMA tuning is not yet performed by the OS
 Persistent tuning is a non-trivial task
 Performance challenges are changing faster than tools
Unfortunately, yes.
3

Motivation
IBM Cell Broadband Engine ccNUMA architecture (SGI Altix 3000)
ccNUMA architecture (IBM)
24 Gb/s24 Gb/s
 Actually, server administrators have different OS knowledge.
 Therefore, all administrators can not manage the NUMA server for the optimizing
memory utilization and performance in real environment.
4

What is the goal mainly?
 Propose the user-space automatic service daemon to
provide the best performance, by avoiding unnecessary
latency. (For newbies at the server administration)
 Binding Processes to NUMA Nodes Automatically in user-space
 Automatically Improve NUMA System Performance with the
Proposed system
 Also, support the manual setting infrastructure like the
existing system for the veteran.
5

Related work
Approach Pros. Cons.
Autonuma
• Kernel-space
• Purely OS approach
• Not aggressive approach
Numa Balancer
• Kernel-space
Melgorman’s
MM
• Kernel-space
Sergey
• User-space approach
• Aggressive approach
• Manual configuration
• Damage of memory utilization
because of Affinity method
• It’s no memory scheduler
UNAS
• Automatic
• User-space approach
• Easy to manage
• Aggressive approach
• Don’t follow-up in-depth Memory
management
• It’s no memory scheduler
6

What is UNAS?
 UNAS is a user-space scheduler that monitors NUMA topology and usage
 UNAS distributes loads for good locality for the purpose of providing the best
performance, by avoiding unnecessary latency.
 Goal of UNAS is to automatically bind processes to NUMA nodes as GPL license.
Initial allocation New NUMA-aware allocation
7

Design
User-space
Scheduler
NUMA List
. . . .
Monitor Reporter
Collect
NUMA
Specific Data
Task
ProcFS&SysFS User-Space Scheduler
User-Space Runtime Monitor
User-space
Kernel-
space
NUMA Memory Node
8

Proposed Scheduler
Algorithm 1. Monitor: Runtime monitoring mechanism
1. Create a new thread for receiving and dealing with the run-time monitoring data
2. Repeat monitoring until NUMA-aware user-space scheduler stop
3. Sleep for an NUMA specific data (from /proc/stat)
4. Collect the monitoring report
5. End Repeat loop
Algorithm 2. Reporter: Collected NUMA specified data reporting mechanism
Input: run-time monitoring data
1. Repeat until runtime monitoring mechanism stop
2. Receiving data and filtering them from online monitoring
3. Collect NUMA specific data
4. If loading of system is unbalanced or behavior of the processes changed or powerful core
is idle
5. Computing the Run-time speedup factor
6. Sorting the process NUMA list by multi-core speedup factor
7. Computing the contention degradation factor
8. Sorting the process NUMA list by contention degradation factor
9. Sending signal to trigger schedule
10. End if
11. End Repeat loop
9

Proposed Scheduler
Algorithm 3. User-space Scheduler: Automatic NUMA aware scheduling
Input: NUMA list
1. Computing the number of powerful core candidate based on load balanced memory policy
2. Retrieving suitable processes to be scheduled on powerful cores from NUMA list
3. Setting static CPU pin from manual input of administrator
4. If retrieved processes != current processes on powerful cores
5. Migrate the processes
6. End if
7. If current resource contention degradation is too big
8. Scatter the processes with heavy contention
9. Calculating degradation factor in order to minimize resource contention degradation
10. Migrate the processes and the its sticky pages
11. End if
10

Flowchart of Proposed Scheduler
Monitoring the characteristics
of NUMA
Setting static CPU pin
manually
Allocate Memory based on
monitoring info.
Reallocate for optimal
allocation dynamically
Per 10
Seconds
Manual Setting by
Administrator
END
START New allocation
Re-allocation
• /proc/<pid>/stat
• /proc/<pid>/numa_maps
• /sys/class/numa_topology
11

Implementation of UNAS
Content Default Value
Max Nodes 256
Max CPUs 2,048
C
P
U
CPU Threshold 30
CPU Scale Factor 100
Memory Threshold 300 MB
Implementation for ( ; ; ) {
if (NUMA) {
update_processes();  "/proc/%s/stat"
interval = manage_loads();  bind_process_and_migrate_memory()
time_interval(10);
}
} invain@numa-server:/proc/2$> cat ./stat
2028 (Xorg) S 1987 2028 2028 1031 2028 4202752 8778 0 41 0 13259 443 0 0 20 0 9 0
2644 238051328 6541 18446744073709551615 1 1 0 0 0 0 0 4096 1367369423
18446744073709551615 0 0 17 53 0 0 30 0 0
Priority
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep heap
7f9219662000 default heap anon=2979 dirty=2979 N0=2 N1=2975 N2=2
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep stack
7fffb6601000 default stack anon=37 dirty=37 N1=37
CPU (0~79)
#Node of
Heap
#Node of
Stack
12

Evaluation
40 cores + 40 threads
• Server : DELL PowerEdge R910
• CPU: Intel Xeon E7-4850 @2.00GHz (40 Cores)
• Memory: 32GiB
• OS: Linux 3.2.
• Platform: Ubuntu 12.04 LTS
• Benchmarks: PARSEC
UNAS
13

References
 Auto NUMA Ver 26 : http://lwn.net/Articles/488709/
 Peter Zijlstra's NUMA scheduling patch set : http://lwn.net/Articles/486858/
 NUMA system call : get_mempolicy(2), mbind(2), igrate_pages(2), move_pages(2), and
set_mempolicy(2).
 Libnuma : Link with -lnuma to get the system call definitions. The numactl package is av
ailable at ftp://oss.sgi.com/www/projects/libnuma/download/. Applications should not us
e these system calls directly. The higher level interface provided by the numa(3) function
s in the numactl package is recommended.
 RHEL 6.3 : Redhat Enterprise Linux ver 6.3; http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/resource_management-
tp.html
 Sergey Blagodurov et al., "A Case for NUMA-aware Contention Management on
Multicore Systems," USENIX ATC 2011
 Yinan Li, et al., "NUMA-Aware Algorithms: the Case of Data Shuffling," CIDR, 2013
17

Conclusion
 All administrators can not easily manage the NUMA server for the
optimizing memory utilization and performance in real environment.
 UNAS is a standalone daemon that monitors NUMA topology and
usage in real-time.
 UNAS distributes loads for good locality for the purpose of providing the best
performance.
 UNAS automatically bind processes to NUMA nodes as Beer license
18

Thank you for your attention
Any questions?
19

BACKUP SLIDES
In Case We Have More Time…
20

Migrating pages to optimize NUMA locality
Content NUMASCHED SchedNUMA AutoNUMA Automatic NUMA
balancing
Who Lee Schermerhorn (HP) Peter Zijlstra (REDHAT) Andrea Arcangeli (REDHAT) Mel Gorman (SUSE)
Progress RFC (at LPC2010) PATCH v1 (Rewrite
NUMASCHED)
Alpha 23 Since Linux 3.8
Key
factors
lazy/auto-migration Allowing processes to be
put into "NUMA groups"
that will share the same
home node.
Scanning / auto-migration SCHEDNUMA +
AUTONUMA
Details Migration when a fault
handler such as
do_swap_page()
finds a cached page
with zero mappings
1) allowing processes to be
put into "NUMA groups :
int numa_mbind ();
2) the NUMA group
identified by ng_id : int
numa_tbind();
Pagetable scanner /
knuma_migrated per NUMA
node queues
Migration when fault
Migration w/ PTE
(Migrate On Reference Of
pte_numa Node
[MORON])
Operati
ons
automatic page
migration for
Virtualization on
X86_64
new system calls
http://lwn.net/Articles/48
6850/
git clone --reference linux -b
autonuma-alpha10
git://git.kernel.org/pub/scm/li
nux/kernel/git/andrea/aa.git
git://git.kernel.org/pub/s
cm/linux/kernel/git/mel/
linux-balancenuma.git
mm-balancenuma-v4r38
Eval 1) Refer to NUMA
Balancer
55% faster than mainline
(Dan Smith )
35% faster than mainline
(Dan Smith )
mmtest utility
21
• GOAL : Keep processes and their memory together on the same NUMA node.
Support for automatically migrating pages to optimize NUMA locality
• Eval 1): https://lkml.org/lkml/2012/3/20/508
• Autonuma benchmark ver 0.1 - git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git
• mmtest by Mel Gorman - http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz (autonumabench)
Time

Tools for NUMA Tuning
 numactl
 cgroups
 taskset
 lstopo
 dmidecode
 sysfs
 irqbalance
 numad
 top
 numatop
 htop
 tuna
 irqstat
 tuned-adm
Removal of existing bottlenecks
 Multi-queue block layer: http://kernel.dk/blk-mq.pdf
Improved tools
 numatop: https://01.org/numatop
 top: https://gitorious.org/procps/procps (top: added NUMA support)
 irqstat: https://github.com/lanceshelton/irqstat (IRQ viewer for NUMA)
 Performance profiling methods: http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-
linuxperformance-checklist/
 NUMA-aware TCMalloc http://developer.amd.com/wordpress/media/2013/03/NUMA-aware-
TCMalloc.zip
22

Appendix /proc/[number]/numa_maps (since Linux
2.6.14)
 This file displays information about a process's NUMA
memory policy and allocation.
 Each line contains information about a memory range used by
the process, displaying--among other information--the effective
memory policy for that
 memory range and on which nodes the pages have been
allocated.
 numa_maps is a read-only file. When /proc/<pid>/numa_maps
is read, the kernel will scan the virtual address space of the
process and report how memory isused. One line is displayed
for each unique memory range of the process.
23

Appendix. /proc/[number]/numa_maps (since
Linux 2.6.14) cont’d
• http://www.kernel.org/doc/man-pages/online/pages/man7/numa.7.html
• http://man7.org/linux/man-pages/man7/numa.7.html
24

Appendix “numactl” Sample
 numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 a
nd 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments Run bi
g database with its memory interleaved on all CPUs.
 numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory a
llocated on node 0 and 1.
 numactl --cpubind=0 --membind=0,1 -- process -l Run process as above, but with an
option (-l) that would be confused with a numactl option.
 numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting st
ate.
 numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared
memory regiion specified by /tmp/shmkey over all nodes.
 numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the s
econd gigabyte in the tmpfs file /dev/shm/A to node 1.
 numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to
the default localalloc policy
25

UNAS-20140123-1800

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to UNAS-20140123-1800

Similar to UNAS-20140123-1800 (20)

More from Samsung Electronics

More from Samsung Electronics (8)

UNAS-20140123-1800