SlideShare a Scribd company logo
1/23/2014 12:15 PM
Geunsik Lim
Sungkyunkwan University
Samsung Electronics
Evolution of Memory Architecture
 System software running on a NUMA architecture needs to be aware of the processor
topology in order to properly allocate memory and processes to maximize performance.
UMA
(Uniform Memory Architecture)
NUMA
(Non-Uniform Memory Architecture)
CPU
CPU CPU
CPU
Memory
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Memory
C1C0
C2 C3
Interconnection Network
Nodes, Sockets, Cores, Threads
Operating System
Run-time System
2
Must end users be NUMA-aware?
 Users must be aware of PCIe device slot placement
 Optimal NUMA tuning is not yet performed by the OS
 Persistent tuning is a non-trivial task
 Performance challenges are changing faster than tools
Unfortunately, yes.
3
Motivation
IBM Cell Broadband Engine ccNUMA architecture (SGI Altix 3000)
ccNUMA architecture (IBM)
24 Gb/s24 Gb/s
 Actually, server administrators have different OS knowledge.
 Therefore, all administrators can not manage the NUMA server for the optimizing
memory utilization and performance in real environment.
4
What is the goal mainly?
 Propose the user-space automatic service daemon to
provide the best performance, by avoiding unnecessary
latency. (For newbies at the server administration)
 Binding Processes to NUMA Nodes Automatically in user-space
 Automatically Improve NUMA System Performance with the
Proposed system
 Also, support the manual setting infrastructure like the
existing system for the veteran.
5
Related work
Approach Pros. Cons.
Autonuma
• Kernel-space
• Purely OS approach
• Not aggressive approach
Numa Balancer
• Kernel-space
• Purely OS approach
• Not aggressive approach
Melgorman’s
MM
• Kernel-space
• Purely OS approach
• Not aggressive approach
Sergey
• User-space approach
• Aggressive approach
• Manual configuration
• Damage of memory utilization
because of Affinity method
• It’s no memory scheduler
UNAS
• Automatic
• User-space approach
• Easy to manage
• Aggressive approach
• Don’t follow-up in-depth Memory
management
• It’s no memory scheduler
6
What is UNAS?
 UNAS is a user-space scheduler that monitors NUMA topology and usage
 UNAS distributes loads for good locality for the purpose of providing the best
performance, by avoiding unnecessary latency.
 Goal of UNAS is to automatically bind processes to NUMA nodes as GPL license.
Initial allocation New NUMA-aware allocation
7
Design
User-space
Scheduler
NUMA List
. . . .
Monitor Reporter
Collect
NUMA
Specific Data
Task
ProcFS&SysFS User-Space Scheduler
User-Space Runtime Monitor
User-space
Kernel-
space
NUMA Memory Node
8
Proposed Scheduler
Algorithm 1. Monitor: Runtime monitoring mechanism
1. Create a new thread for receiving and dealing with the run-time monitoring data
2. Repeat monitoring until NUMA-aware user-space scheduler stop
3. Sleep for an NUMA specific data (from /proc/stat)
4. Collect the monitoring report
5. End Repeat loop
Algorithm 2. Reporter: Collected NUMA specified data reporting mechanism
Input: run-time monitoring data
1. Repeat until runtime monitoring mechanism stop
2. Receiving data and filtering them from online monitoring
3. Collect NUMA specific data
4. If loading of system is unbalanced or behavior of the processes changed or powerful core
is idle
5. Computing the Run-time speedup factor
6. Sorting the process NUMA list by multi-core speedup factor
7. Computing the contention degradation factor
8. Sorting the process NUMA list by contention degradation factor
9. Sending signal to trigger schedule
10. End if
11. End Repeat loop
9
Proposed Scheduler
Algorithm 3. User-space Scheduler: Automatic NUMA aware scheduling
Input: NUMA list
1. Computing the number of powerful core candidate based on load balanced memory policy
2. Retrieving suitable processes to be scheduled on powerful cores from NUMA list
3. Setting static CPU pin from manual input of administrator
4. If retrieved processes != current processes on powerful cores
5. Migrate the processes
6. End if
7. If current resource contention degradation is too big
8. Scatter the processes with heavy contention
9. Calculating degradation factor in order to minimize resource contention degradation
10. Migrate the processes and the its sticky pages
11. End if
10
Flowchart of Proposed Scheduler
Monitoring the characteristics
of NUMA
Setting static CPU pin
manually
Allocate Memory based on
monitoring info.
Reallocate for optimal
allocation dynamically
Per 10
Seconds
Manual Setting by
Administrator
END
START New allocation
Re-allocation
• /proc/<pid>/stat
• /proc/<pid>/numa_maps
• /sys/class/numa_topology
11
Implementation of UNAS
Content Default Value
Max Nodes 256
Max CPUs 2,048
C
P
U
CPU Threshold 30
CPU Scale Factor 100
Memory Threshold 300 MB
Implementation for ( ; ; ) {
if (NUMA) {
update_processes();  "/proc/%s/stat"
interval = manage_loads();  bind_process_and_migrate_memory()
time_interval(10);
}
} invain@numa-server:/proc/2$> cat ./stat
2028 (Xorg) S 1987 2028 2028 1031 2028 4202752 8778 0 41 0 13259 443 0 0 20 0 9 0
2644 238051328 6541 18446744073709551615 1 1 0 0 0 0 0 4096 1367369423
18446744073709551615 0 0 17 53 0 0 30 0 0
Priority
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep heap
7f9219662000 default heap anon=2979 dirty=2979 N0=2 N1=2975 N2=2
invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep stack
7fffb6601000 default stack anon=37 dirty=37 N1=37
CPU (0~79)
#Node of
Heap
#Node of
Stack
12
Evaluation
40 cores + 40 threads
• Server : DELL PowerEdge R910
• CPU: Intel Xeon E7-4850 @2.00GHz (40 Cores)
• Memory: 32GiB
• OS: Linux 3.2.
• Platform: Ubuntu 12.04 LTS
• Benchmarks: PARSEC
UNAS
13
Evaluation
14
Evaluation
15
Evaluation
16
References
 Auto NUMA Ver 26 : http://lwn.net/Articles/488709/
 Peter Zijlstra's NUMA scheduling patch set : http://lwn.net/Articles/486858/
 NUMA system call : get_mempolicy(2), mbind(2), igrate_pages(2), move_pages(2), and
set_mempolicy(2).
 Libnuma : Link with -lnuma to get the system call definitions. The numactl package is av
ailable at ftp://oss.sgi.com/www/projects/libnuma/download/. Applications should not us
e these system calls directly. The higher level interface provided by the numa(3) function
s in the numactl package is recommended.
 RHEL 6.3 : Redhat Enterprise Linux ver 6.3; http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/resource_management-
tp.html
 Sergey Blagodurov et al., "A Case for NUMA-aware Contention Management on
Multicore Systems," USENIX ATC 2011
 Yinan Li, et al., "NUMA-Aware Algorithms: the Case of Data Shuffling," CIDR, 2013
17
Conclusion
 All administrators can not easily manage the NUMA server for the
optimizing memory utilization and performance in real environment.
 UNAS is a standalone daemon that monitors NUMA topology and
usage in real-time.
 UNAS distributes loads for good locality for the purpose of providing the best
performance.
 UNAS automatically bind processes to NUMA nodes as Beer license
18
Thank you for your attention
Any questions?
19
BACKUP SLIDES
In Case We Have More Time…
20
Migrating pages to optimize NUMA locality
Content NUMASCHED SchedNUMA AutoNUMA Automatic NUMA
balancing
Who Lee Schermerhorn (HP) Peter Zijlstra (REDHAT) Andrea Arcangeli (REDHAT) Mel Gorman (SUSE)
Progress RFC (at LPC2010) PATCH v1 (Rewrite
NUMASCHED)
Alpha 23 Since Linux 3.8
Key
factors
lazy/auto-migration Allowing processes to be
put into "NUMA groups"
that will share the same
home node.
Scanning / auto-migration SCHEDNUMA +
AUTONUMA
Details Migration when a fault
handler such as
do_swap_page()
finds a cached page
with zero mappings
1) allowing processes to be
put into "NUMA groups :
int numa_mbind ();
2) the NUMA group
identified by ng_id : int
numa_tbind();
Pagetable scanner /
knuma_migrated per NUMA
node queues
Migration when fault
Migration w/ PTE
(Migrate On Reference Of
pte_numa Node
[MORON])
Operati
ons
automatic page
migration for
Virtualization on
X86_64
new system calls
http://lwn.net/Articles/48
6850/
git clone --reference linux -b
autonuma-alpha10
git://git.kernel.org/pub/scm/li
nux/kernel/git/andrea/aa.git
git://git.kernel.org/pub/s
cm/linux/kernel/git/mel/
linux-balancenuma.git
mm-balancenuma-v4r38
Eval 1) Refer to NUMA
Balancer
55% faster than mainline
(Dan Smith )
35% faster than mainline
(Dan Smith )
mmtest utility
21
• GOAL : Keep processes and their memory together on the same NUMA node.
Support for automatically migrating pages to optimize NUMA locality
• Eval 1): https://lkml.org/lkml/2012/3/20/508
• Autonuma benchmark ver 0.1 - git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git
• mmtest by Mel Gorman - http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz (autonumabench)
Time
Tools for NUMA Tuning
 numactl
 cgroups
 taskset
 lstopo
 dmidecode
 sysfs
 irqbalance
 numad
 top
 numatop
 htop
 tuna
 irqstat
 tuned-adm
Removal of existing bottlenecks
 Multi-queue block layer: http://kernel.dk/blk-mq.pdf
Improved tools
 numatop: https://01.org/numatop
 top: https://gitorious.org/procps/procps (top: added NUMA support)
 irqstat: https://github.com/lanceshelton/irqstat (IRQ viewer for NUMA)
 Performance profiling methods: http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-
linuxperformance-checklist/
 NUMA-aware TCMalloc http://developer.amd.com/wordpress/media/2013/03/NUMA-aware-
TCMalloc.zip
22
Appendix /proc/[number]/numa_maps (since Linux
2.6.14)
 This file displays information about a process's NUMA
memory policy and allocation.
 Each line contains information about a memory range used by
the process, displaying--among other information--the effective
memory policy for that
 memory range and on which nodes the pages have been
allocated.
 numa_maps is a read-only file. When /proc/<pid>/numa_maps
is read, the kernel will scan the virtual address space of the
process and report how memory isused. One line is displayed
for each unique memory range of the process.
23
Appendix. /proc/[number]/numa_maps (since
Linux 2.6.14) cont’d
• http://www.kernel.org/doc/man-pages/online/pages/man7/numa.7.html
• http://man7.org/linux/man-pages/man7/numa.7.html
24
Appendix “numactl” Sample
 numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 a
nd 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments Run bi
g database with its memory interleaved on all CPUs.
 numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory a
llocated on node 0 and 1.
 numactl --cpubind=0 --membind=0,1 -- process -l Run process as above, but with an
option (-l) that would be confused with a numactl option.
 numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting st
ate.
 numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared
memory regiion specified by /tmp/shmkey over all nodes.
 numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the s
econd gigabyte in the tmpfs file /dev/shm/A to node 1.
 numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to
the default localalloc policy
25

More Related Content

What's hot

linux monitoring and performance tunning
linux monitoring and performance tunning linux monitoring and performance tunning
linux monitoring and performance tunning
iman darabi
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg
 
Ch6 cpu scheduling
Ch6   cpu schedulingCh6   cpu scheduling
Ch6 cpu scheduling
Welly Dian Astika
 
Refining Linux
Refining LinuxRefining Linux
Refining Linux
Jason Murray
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freq
Linaro
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
Joshua Mora
 
HKG15-100: What is Linaro working on - core development lightning talks
HKG15-100:  What is Linaro working on - core development lightning talksHKG15-100:  What is Linaro working on - core development lightning talks
HKG15-100: What is Linaro working on - core development lightning talks
Linaro
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
Keith Wright
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
Brendan Gregg
 
Linux System Monitoring
Linux System Monitoring Linux System Monitoring
Linux System Monitoring
PriyaTeli
 
Linux System Monitoring basic commands
Linux System Monitoring basic commandsLinux System Monitoring basic commands
Linux System Monitoring basic commands
Mohammad Rafiee
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
Babak Farrokhi
 
System performance monitoring pcp + vector
System performance monitoring   pcp + vectorSystem performance monitoring   pcp + vector
System performance monitoring pcp + vector
Sandeep Kunkunuru
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
Mydbops
 
Measuring directly from cpu hardware performance counters
Measuring directly from cpu  hardware performance countersMeasuring directly from cpu  hardware performance counters
Measuring directly from cpu hardware performance counters
Jean-Philippe BEMPEL
 
Operating Systems - Processor Management
Operating Systems - Processor ManagementOperating Systems - Processor Management
Operating Systems - Processor Management
Damian T. Gordon
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
Hao-Ran Liu
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
Brendan Gregg
 

What's hot (20)

linux monitoring and performance tunning
linux monitoring and performance tunning linux monitoring and performance tunning
linux monitoring and performance tunning
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
Ch6 cpu scheduling
Ch6   cpu schedulingCh6   cpu scheduling
Ch6 cpu scheduling
 
Refining Linux
Refining LinuxRefining Linux
Refining Linux
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freq
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Linux monitoring
Linux monitoringLinux monitoring
Linux monitoring
 
HKG15-100: What is Linaro working on - core development lightning talks
HKG15-100:  What is Linaro working on - core development lightning talksHKG15-100:  What is Linaro working on - core development lightning talks
HKG15-100: What is Linaro working on - core development lightning talks
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Linux System Monitoring
Linux System Monitoring Linux System Monitoring
Linux System Monitoring
 
Linux System Monitoring basic commands
Linux System Monitoring basic commandsLinux System Monitoring basic commands
Linux System Monitoring basic commands
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
System performance monitoring pcp + vector
System performance monitoring   pcp + vectorSystem performance monitoring   pcp + vector
System performance monitoring pcp + vector
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
Measuring directly from cpu hardware performance counters
Measuring directly from cpu  hardware performance countersMeasuring directly from cpu  hardware performance counters
Measuring directly from cpu hardware performance counters
 
Operating Systems - Processor Management
Operating Systems - Processor ManagementOperating Systems - Processor Management
Operating Systems - Processor Management
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 

Similar to UNAS-20140123-1800

AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
Amazon Web Services
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in india
Preeti Chauhan
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
Amazon Web Services
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Amazon Web Services
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Aman 16 os sheduling algorithm methods.pptx
Aman 16 os sheduling algorithm methods.pptxAman 16 os sheduling algorithm methods.pptx
Aman 16 os sheduling algorithm methods.pptx
vikramkagitapu
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
Amazon Web Services
 
unit_1.pdf
unit_1.pdfunit_1.pdf
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
Amazon Web Services
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Amazon Web Services
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisation
Pavithra S
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
Karunamoorthy B
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Amazon Web Services
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
Rajesh Gupta
 
JP Morgan Remote to Core Implementation
JP Morgan Remote to Core ImplementationJP Morgan Remote to Core Implementation
JP Morgan Remote to Core Implementation
John Napier
 

Similar to UNAS-20140123-1800 (20)

AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in india
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Aman 16 os sheduling algorithm methods.pptx
Aman 16 os sheduling algorithm methods.pptxAman 16 os sheduling algorithm methods.pptx
Aman 16 os sheduling algorithm methods.pptx
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
unit_1.pdf
unit_1.pdfunit_1.pdf
unit_1.pdf
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
fall2013
fall2013fall2013
fall2013
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisation
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
JP Morgan Remote to Core Implementation
JP Morgan Remote to Core ImplementationJP Morgan Remote to Core Implementation
JP Morgan Remote to Core Implementation
 
Aca 2
Aca 2Aca 2
Aca 2
 

More from Samsung Electronics

Samsung ARM Chromebook1/2 (for Hackers & System Developers)
Samsung ARM Chromebook1/2 (for Hackers & System Developers)Samsung ARM Chromebook1/2 (for Hackers & System Developers)
Samsung ARM Chromebook1/2 (for Hackers & System Developers)
Samsung Electronics
 
Distributed Build to Speed-up Compilation of Tizen Package
Distributed Build to Speed-up Compilation of Tizen PackageDistributed Build to Speed-up Compilation of Tizen Package
Distributed Build to Speed-up Compilation of Tizen Package
Samsung Electronics
 
load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940Samsung Electronics
 
kics2013-winter-biomp-slide-20130127-1340
kics2013-winter-biomp-slide-20130127-1340kics2013-winter-biomp-slide-20130127-1340
kics2013-winter-biomp-slide-20130127-1340Samsung Electronics
 
Remote-debugging-based-on-notrace32-20130619-1900
Remote-debugging-based-on-notrace32-20130619-1900Remote-debugging-based-on-notrace32-20130619-1900
Remote-debugging-based-on-notrace32-20130619-1900Samsung Electronics
 
booting-booster-final-20160420-0700
booting-booster-final-20160420-0700booting-booster-final-20160420-0700
booting-booster-final-20160420-0700Samsung Electronics
 

More from Samsung Electronics (8)

Samsung ARM Chromebook1/2 (for Hackers & System Developers)
Samsung ARM Chromebook1/2 (for Hackers & System Developers)Samsung ARM Chromebook1/2 (for Hackers & System Developers)
Samsung ARM Chromebook1/2 (for Hackers & System Developers)
 
Distributed Build to Speed-up Compilation of Tizen Package
Distributed Build to Speed-up Compilation of Tizen PackageDistributed Build to Speed-up Compilation of Tizen Package
Distributed Build to Speed-up Compilation of Tizen Package
 
load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940
 
kics2013-winter-biomp-slide-20130127-1340
kics2013-winter-biomp-slide-20130127-1340kics2013-winter-biomp-slide-20130127-1340
kics2013-winter-biomp-slide-20130127-1340
 
gcce-uapm-slide-20131001-1900
gcce-uapm-slide-20131001-1900gcce-uapm-slide-20131001-1900
gcce-uapm-slide-20131001-1900
 
distcom-short-20140112-1600
distcom-short-20140112-1600distcom-short-20140112-1600
distcom-short-20140112-1600
 
Remote-debugging-based-on-notrace32-20130619-1900
Remote-debugging-based-on-notrace32-20130619-1900Remote-debugging-based-on-notrace32-20130619-1900
Remote-debugging-based-on-notrace32-20130619-1900
 
booting-booster-final-20160420-0700
booting-booster-final-20160420-0700booting-booster-final-20160420-0700
booting-booster-final-20160420-0700
 

UNAS-20140123-1800

  • 1. 1/23/2014 12:15 PM Geunsik Lim Sungkyunkwan University Samsung Electronics
  • 2. Evolution of Memory Architecture  System software running on a NUMA architecture needs to be aware of the processor topology in order to properly allocate memory and processes to maximize performance. UMA (Uniform Memory Architecture) NUMA (Non-Uniform Memory Architecture) CPU CPU CPU CPU Memory Memory C1C0 C2 C3 Memory C1C0 C2 C3 Memory C1C0 C2 C3 Memory C1C0 C2 C3 Interconnection Network Nodes, Sockets, Cores, Threads Operating System Run-time System 2
  • 3. Must end users be NUMA-aware?  Users must be aware of PCIe device slot placement  Optimal NUMA tuning is not yet performed by the OS  Persistent tuning is a non-trivial task  Performance challenges are changing faster than tools Unfortunately, yes. 3
  • 4. Motivation IBM Cell Broadband Engine ccNUMA architecture (SGI Altix 3000) ccNUMA architecture (IBM) 24 Gb/s24 Gb/s  Actually, server administrators have different OS knowledge.  Therefore, all administrators can not manage the NUMA server for the optimizing memory utilization and performance in real environment. 4
  • 5. What is the goal mainly?  Propose the user-space automatic service daemon to provide the best performance, by avoiding unnecessary latency. (For newbies at the server administration)  Binding Processes to NUMA Nodes Automatically in user-space  Automatically Improve NUMA System Performance with the Proposed system  Also, support the manual setting infrastructure like the existing system for the veteran. 5
  • 6. Related work Approach Pros. Cons. Autonuma • Kernel-space • Purely OS approach • Not aggressive approach Numa Balancer • Kernel-space • Purely OS approach • Not aggressive approach Melgorman’s MM • Kernel-space • Purely OS approach • Not aggressive approach Sergey • User-space approach • Aggressive approach • Manual configuration • Damage of memory utilization because of Affinity method • It’s no memory scheduler UNAS • Automatic • User-space approach • Easy to manage • Aggressive approach • Don’t follow-up in-depth Memory management • It’s no memory scheduler 6
  • 7. What is UNAS?  UNAS is a user-space scheduler that monitors NUMA topology and usage  UNAS distributes loads for good locality for the purpose of providing the best performance, by avoiding unnecessary latency.  Goal of UNAS is to automatically bind processes to NUMA nodes as GPL license. Initial allocation New NUMA-aware allocation 7
  • 8. Design User-space Scheduler NUMA List . . . . Monitor Reporter Collect NUMA Specific Data Task ProcFS&SysFS User-Space Scheduler User-Space Runtime Monitor User-space Kernel- space NUMA Memory Node 8
  • 9. Proposed Scheduler Algorithm 1. Monitor: Runtime monitoring mechanism 1. Create a new thread for receiving and dealing with the run-time monitoring data 2. Repeat monitoring until NUMA-aware user-space scheduler stop 3. Sleep for an NUMA specific data (from /proc/stat) 4. Collect the monitoring report 5. End Repeat loop Algorithm 2. Reporter: Collected NUMA specified data reporting mechanism Input: run-time monitoring data 1. Repeat until runtime monitoring mechanism stop 2. Receiving data and filtering them from online monitoring 3. Collect NUMA specific data 4. If loading of system is unbalanced or behavior of the processes changed or powerful core is idle 5. Computing the Run-time speedup factor 6. Sorting the process NUMA list by multi-core speedup factor 7. Computing the contention degradation factor 8. Sorting the process NUMA list by contention degradation factor 9. Sending signal to trigger schedule 10. End if 11. End Repeat loop 9
  • 10. Proposed Scheduler Algorithm 3. User-space Scheduler: Automatic NUMA aware scheduling Input: NUMA list 1. Computing the number of powerful core candidate based on load balanced memory policy 2. Retrieving suitable processes to be scheduled on powerful cores from NUMA list 3. Setting static CPU pin from manual input of administrator 4. If retrieved processes != current processes on powerful cores 5. Migrate the processes 6. End if 7. If current resource contention degradation is too big 8. Scatter the processes with heavy contention 9. Calculating degradation factor in order to minimize resource contention degradation 10. Migrate the processes and the its sticky pages 11. End if 10
  • 11. Flowchart of Proposed Scheduler Monitoring the characteristics of NUMA Setting static CPU pin manually Allocate Memory based on monitoring info. Reallocate for optimal allocation dynamically Per 10 Seconds Manual Setting by Administrator END START New allocation Re-allocation • /proc/<pid>/stat • /proc/<pid>/numa_maps • /sys/class/numa_topology 11
  • 12. Implementation of UNAS Content Default Value Max Nodes 256 Max CPUs 2,048 C P U CPU Threshold 30 CPU Scale Factor 100 Memory Threshold 300 MB Implementation for ( ; ; ) { if (NUMA) { update_processes();  "/proc/%s/stat" interval = manage_loads();  bind_process_and_migrate_memory() time_interval(10); } } invain@numa-server:/proc/2$> cat ./stat 2028 (Xorg) S 1987 2028 2028 1031 2028 4202752 8778 0 41 0 13259 443 0 0 20 0 9 0 2644 238051328 6541 18446744073709551615 1 1 0 0 0 0 0 4096 1367369423 18446744073709551615 0 0 17 53 0 0 30 0 0 Priority invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep heap 7f9219662000 default heap anon=2979 dirty=2979 N0=2 N1=2975 N2=2 invain@numa-server:/proc/2028$ sudo cat ./numa_maps | grep stack 7fffb6601000 default stack anon=37 dirty=37 N1=37 CPU (0~79) #Node of Heap #Node of Stack 12
  • 13. Evaluation 40 cores + 40 threads • Server : DELL PowerEdge R910 • CPU: Intel Xeon E7-4850 @2.00GHz (40 Cores) • Memory: 32GiB • OS: Linux 3.2. • Platform: Ubuntu 12.04 LTS • Benchmarks: PARSEC UNAS 13
  • 17. References  Auto NUMA Ver 26 : http://lwn.net/Articles/488709/  Peter Zijlstra's NUMA scheduling patch set : http://lwn.net/Articles/486858/  NUMA system call : get_mempolicy(2), mbind(2), igrate_pages(2), move_pages(2), and set_mempolicy(2).  Libnuma : Link with -lnuma to get the system call definitions. The numactl package is av ailable at ftp://oss.sgi.com/www/projects/libnuma/download/. Applications should not us e these system calls directly. The higher level interface provided by the numa(3) function s in the numactl package is recommended.  RHEL 6.3 : Redhat Enterprise Linux ver 6.3; http://docs.redhat.com/docs/en- US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/resource_management- tp.html  Sergey Blagodurov et al., "A Case for NUMA-aware Contention Management on Multicore Systems," USENIX ATC 2011  Yinan Li, et al., "NUMA-Aware Algorithms: the Case of Data Shuffling," CIDR, 2013 17
  • 18. Conclusion  All administrators can not easily manage the NUMA server for the optimizing memory utilization and performance in real environment.  UNAS is a standalone daemon that monitors NUMA topology and usage in real-time.  UNAS distributes loads for good locality for the purpose of providing the best performance.  UNAS automatically bind processes to NUMA nodes as Beer license 18
  • 19. Thank you for your attention Any questions? 19
  • 20. BACKUP SLIDES In Case We Have More Time… 20
  • 21. Migrating pages to optimize NUMA locality Content NUMASCHED SchedNUMA AutoNUMA Automatic NUMA balancing Who Lee Schermerhorn (HP) Peter Zijlstra (REDHAT) Andrea Arcangeli (REDHAT) Mel Gorman (SUSE) Progress RFC (at LPC2010) PATCH v1 (Rewrite NUMASCHED) Alpha 23 Since Linux 3.8 Key factors lazy/auto-migration Allowing processes to be put into "NUMA groups" that will share the same home node. Scanning / auto-migration SCHEDNUMA + AUTONUMA Details Migration when a fault handler such as do_swap_page() finds a cached page with zero mappings 1) allowing processes to be put into "NUMA groups : int numa_mbind (); 2) the NUMA group identified by ng_id : int numa_tbind(); Pagetable scanner / knuma_migrated per NUMA node queues Migration when fault Migration w/ PTE (Migrate On Reference Of pte_numa Node [MORON]) Operati ons automatic page migration for Virtualization on X86_64 new system calls http://lwn.net/Articles/48 6850/ git clone --reference linux -b autonuma-alpha10 git://git.kernel.org/pub/scm/li nux/kernel/git/andrea/aa.git git://git.kernel.org/pub/s cm/linux/kernel/git/mel/ linux-balancenuma.git mm-balancenuma-v4r38 Eval 1) Refer to NUMA Balancer 55% faster than mainline (Dan Smith ) 35% faster than mainline (Dan Smith ) mmtest utility 21 • GOAL : Keep processes and their memory together on the same NUMA node. Support for automatically migrating pages to optimize NUMA locality • Eval 1): https://lkml.org/lkml/2012/3/20/508 • Autonuma benchmark ver 0.1 - git clone git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git • mmtest by Mel Gorman - http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz (autonumabench) Time
  • 22. Tools for NUMA Tuning  numactl  cgroups  taskset  lstopo  dmidecode  sysfs  irqbalance  numad  top  numatop  htop  tuna  irqstat  tuned-adm Removal of existing bottlenecks  Multi-queue block layer: http://kernel.dk/blk-mq.pdf Improved tools  numatop: https://01.org/numatop  top: https://gitorious.org/procps/procps (top: added NUMA support)  irqstat: https://github.com/lanceshelton/irqstat (IRQ viewer for NUMA)  Performance profiling methods: http://dtrace.org/blogs/brendan/2012/03/07/the-use-method- linuxperformance-checklist/  NUMA-aware TCMalloc http://developer.amd.com/wordpress/media/2013/03/NUMA-aware- TCMalloc.zip 22
  • 23. Appendix /proc/[number]/numa_maps (since Linux 2.6.14)  This file displays information about a process's NUMA memory policy and allocation.  Each line contains information about a memory range used by the process, displaying--among other information--the effective memory policy for that  memory range and on which nodes the pages have been allocated.  numa_maps is a read-only file. When /proc/<pid>/numa_maps is read, the kernel will scan the virtual address space of the process and report how memory isused. One line is displayed for each unique memory range of the process. 23
  • 24. Appendix. /proc/[number]/numa_maps (since Linux 2.6.14) cont’d • http://www.kernel.org/doc/man-pages/online/pages/man7/numa.7.html • http://man7.org/linux/man-pages/man7/numa.7.html 24
  • 25. Appendix “numactl” Sample  numactl --physcpubind=+0-4,8-12 myapplic arguments Run myapplic on cpus 0-4 a nd 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments Run bi g database with its memory interleaved on all CPUs.  numactl --cpubind=0 --membind=0,1 process Run process on node 0 with memory a llocated on node 0 and 1.  numactl --cpubind=0 --membind=0,1 -- process -l Run process as above, but with an option (-l) that would be confused with a numactl option.  numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting st ate.  numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory regiion specified by /tmp/shmkey over all nodes.  numactl --offset=1G --length=1G --membind=1 --file /dev/shm/A --touch Bind the s econd gigabyte in the tmpfs file /dev/shm/A to node 1.  numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the default localalloc policy 25