A short and very cursory look at some of the features that make modern (x86) CPUs "modern".
I wished to include more examples, time-comparisons and more detailed information, but I the time allotted to the presentation barely allowed even this.
This was the first time I was presenting the subject, so expect much roughness around the edges.
Also, if you are even remotely interested in modern CPUs and caches and whatnot, don't look at this; Google for Cliff Click's excellent talk "A Crash Course in Modern Hardware".
The document describes a presentation on Oracle Performance Measurement and Tuning with Solaris DTrace. It discusses using DTrace and BTrace to dynamically instrument applications like WebLogic Server. A key benefit is reducing the "observer effect" by using low overhead tools that allow probes to be enabled and disabled dynamically.
The document provides troubleshooting steps for resolving common problems during the installation of virtual machines. It describes solutions for errors related to mounting an NFS share, starting a domain, permissions issues preventing log files from being read, and network interfaces having duplicate MAC addresses. The solutions include installing NFS utilities, editing the /etc/hosts file, changing file ownership and permissions, restarting domains and networks, and checking firewall and Apache configurations.
The document discusses using netlink and netlink families to enable communication between the kernel and user space processes. Netlink allows a more flexible alternative to ioctl calls by providing a socket-based interface for exchanging information. Various netlink families exist for communicating with different kernel modules, including NETLINK_ROUTE for routing functions. The document then discusses using netlink and the arpd tool to dynamically update ARP and bridge forwarding database tables in response to layer 2 and layer 3 miss events. This allows maintaining overlay networks by handling ARP requests between network namespaces.
The document provides an overview of tips and best practices for installing and configuring Linux on IBM System z mainframes and virtual machines. It covers topics such as bootstrapping options, DASD layouts, init scripts, networking configuration, filesystem types, replication techniques, and interacting with the hypervisor. The document is intended as a reference for both new and experienced users of Linux on System z.
The document discusses tips for malloc and free in C, including making your own malloc library for troubleshooting. It covers system calls like brk/sbrk and mmap/munmap that are used to allocate memory in user space. It also provides tips for the glibc malloc implementation, such as functions like mallopt, malloc_stats, and malloc_usable_size. Finally, it discusses two methods for hooking and replacing malloc - using LD_PRELOAD and dlsym, or the __malloc_hook mechanism.
Make container without_docker_6-overlay-network_1 Sam Kim
분산환경에서 컨테이너 간의 통신은 어떻게 이루어 지는 것일까요? 3,4편에서는 호스트 안에 가상네트워크를 만들어보았습니다. 6편에서는 이를 바탕으로 분산환경에서 호스트 간에 가상 네트워크로 통신이 가능하도록 만들어 봅니다. 이 방법은 실제 쿠버네티스 flannel 등의 CNI에서 사용하고 있는 vxlan 기반의 오버레이 네트워크 구성을 다룹니다.
This document discusses various security concepts in FreeBSD including file system protections like flags, access control lists, kernel security levels, jails, and techniques to harden the operating system like write xor execute (W^X) and cryptography. It provides details on file flags like immutable, append-only, and nodump and how they can be used to secure files. It also explains how to create and manage jails to isolate processes and provides an overview of tools to help administer jails.
This document discusses setting up a network bridge without Docker. It provides a Vagrantfile to configure a virtual machine environment with Ubuntu 18.04, along with tools like Go and Docker installed. Instructions are given to create a bridge between two network namespaces called RED and BLUE using IP addresses in the 11.11.11.0/24 range. Tests show that hosts can ping each other within this network but not across the real interface and IP range of the host machine. Additional routing and IP configuration is needed to allow outside communication.
The document describes a presentation on Oracle Performance Measurement and Tuning with Solaris DTrace. It discusses using DTrace and BTrace to dynamically instrument applications like WebLogic Server. A key benefit is reducing the "observer effect" by using low overhead tools that allow probes to be enabled and disabled dynamically.
The document provides troubleshooting steps for resolving common problems during the installation of virtual machines. It describes solutions for errors related to mounting an NFS share, starting a domain, permissions issues preventing log files from being read, and network interfaces having duplicate MAC addresses. The solutions include installing NFS utilities, editing the /etc/hosts file, changing file ownership and permissions, restarting domains and networks, and checking firewall and Apache configurations.
The document discusses using netlink and netlink families to enable communication between the kernel and user space processes. Netlink allows a more flexible alternative to ioctl calls by providing a socket-based interface for exchanging information. Various netlink families exist for communicating with different kernel modules, including NETLINK_ROUTE for routing functions. The document then discusses using netlink and the arpd tool to dynamically update ARP and bridge forwarding database tables in response to layer 2 and layer 3 miss events. This allows maintaining overlay networks by handling ARP requests between network namespaces.
The document provides an overview of tips and best practices for installing and configuring Linux on IBM System z mainframes and virtual machines. It covers topics such as bootstrapping options, DASD layouts, init scripts, networking configuration, filesystem types, replication techniques, and interacting with the hypervisor. The document is intended as a reference for both new and experienced users of Linux on System z.
The document discusses tips for malloc and free in C, including making your own malloc library for troubleshooting. It covers system calls like brk/sbrk and mmap/munmap that are used to allocate memory in user space. It also provides tips for the glibc malloc implementation, such as functions like mallopt, malloc_stats, and malloc_usable_size. Finally, it discusses two methods for hooking and replacing malloc - using LD_PRELOAD and dlsym, or the __malloc_hook mechanism.
Make container without_docker_6-overlay-network_1 Sam Kim
분산환경에서 컨테이너 간의 통신은 어떻게 이루어 지는 것일까요? 3,4편에서는 호스트 안에 가상네트워크를 만들어보았습니다. 6편에서는 이를 바탕으로 분산환경에서 호스트 간에 가상 네트워크로 통신이 가능하도록 만들어 봅니다. 이 방법은 실제 쿠버네티스 flannel 등의 CNI에서 사용하고 있는 vxlan 기반의 오버레이 네트워크 구성을 다룹니다.
This document discusses various security concepts in FreeBSD including file system protections like flags, access control lists, kernel security levels, jails, and techniques to harden the operating system like write xor execute (W^X) and cryptography. It provides details on file flags like immutable, append-only, and nodump and how they can be used to secure files. It also explains how to create and manage jails to isolate processes and provides an overview of tools to help administer jails.
This document discusses setting up a network bridge without Docker. It provides a Vagrantfile to configure a virtual machine environment with Ubuntu 18.04, along with tools like Go and Docker installed. Instructions are given to create a bridge between two network namespaces called RED and BLUE using IP addresses in the 11.11.11.0/24 range. Tests show that hosts can ping each other within this network but not across the real interface and IP range of the host machine. Additional routing and IP configuration is needed to allow outside communication.
This document discusses using Docker containers without Docker. It provides a Vagrantfile configuration to set up a virtual machine environment for experiments. The Vagrantfile configures a Ubuntu 18.04 virtual machine with Docker, Go, and other tools installed. The document then covers mounting namespaces and how to isolate the root filesystem of a process to emulate containers without Docker.
This document provides instructions for setting up FreeBSD jails with a shared read-only template and individual read-write partitions for each jail. It describes creating a master template with installed binaries and ports, then creating directories for each jail mounted via nullfs. Individual jail configurations are added to rc.conf and the jails are started and can be managed via jexec. Upgrades involve building a new template and restarting the jails.
Install tomcat 5.5 in debian os and deploy war fileNguyen Cao Hung
This document provides instructions for installing and configuring Tomcat and PostgreSQL on a Debian server. It includes steps to install Tomcat 5 and 7, add users, configure Java version 1.7, copy files to the server, configure properties, recompile for the Debian environment, deploy WAR files, test, and check services. It also provides commands for managing files, directories, and users in Linux.
Logical Volume Management ("LVM") on linux looks like a complicated mess at first. The basics are not all that hard, and some features like mirroring, dynamic space management, snapshots for stable backups, mirroring, and over-provisioning via thin volumes can save a lot of time and effort.
This document discusses using Btrfs and Snapper to enable full system rollbacks in Linux. It describes how snapshots are automatically created to capture the state of the system before changes. Using Snapper, administrators can rollback the entire system to a previous snapshot to undo changes or revert to a known good state. The document provides examples of rolling back packages, kernels and system configuration changes while ensuring system integrity and compliance.
SnapDiff is NetApp's proprietary indexing engine that identifies file differences between two snapshots. It compares the base and diff snapshots using inode-walk and returns a list of added, deleted, modified, and renamed files. Backup software vendors can integrate their solutions with SnapDiff through its API to perform incremental backups using snapshots. SnapDiff provides faster indexing compared to traditional file system crawlers or NDMP backups by identifying file changes at the inode level rather than performing a full tree walk.
The document discusses ioremap and mmap functions in Linux for mapping physical addresses into the virtual address space. Ioremap is used when physical addresses are larger than the virtual address space size. It maps physical addresses to virtual addresses that can be accessed by the CPU. Mmap allows a process to map pages of a file into virtual memory. It is useful for reducing memory copies and improving performance of file read/write operations. The document outlines the functions, flags, and flows of ioremap, mmap, and implementing a custom mmap file operation for direct physical memory mapping.
The document discusses several key topics about the FreeBSD operating system including:
- How to use the virtual consoles of FreeBSD and log into the system.
- An overview of UNIX file permissions and flags in FreeBSD.
- The default directory structure and disk organization of FreeBSD.
- How to mount and unmount file systems using the fstab file and mount/unmount commands.
- Concepts of processes, daemons, signals and killing processes.
- What shells are and how to change your default login shell.
This document discusses experimenting with cgroups in Docker containers to isolate processes. It describes installing Docker, launching an Ubuntu container with capabilities enabled, and installing cgroup tools. It then mounts the cpuset and cpu cgroup hierarchies and creates low and high cgroups. Different CPU shares are assigned to each cgroup and processes are run in each to demonstrate the CPU isolation between cgroups.
The document provides an overview of the FreeBSD operating system boot process and file hierarchy. It discusses how the boot0 and boot1 programs initialize the system and load the boot2 program. Boot2 then loads the loader, which uses a bootinfo structure to pass information to the kernel and load it into memory to start the operating system. The document also describes the standard FreeBSD file hierarchy and the purpose of the main directories like /bin, /sbin, /etc, /usr, /var, and /boot.
Lecture 7: Introduction to Quantum Chemical Simulation graduate course taught at MIT in Fall 2014 by Heather Kulik. This course covers: wavefunction theory, density functional theory, force fields and molecular dynamics and sampling.
The document provides an overview of User Mode Linux (UML), including what it is, how it works, alternatives, and how to use it. UML allows running the Linux kernel as a userspace process, enabling uses like kernel debugging, security testing, and hosting virtual servers. It works by modifying the host kernel to create separate address spaces for guest kernels and processes using hardware virtualization. Key components discussed include filesystems, networking using TUN/TAP devices, management scripts, backups using LVM snapshots and blocksync, and network monitoring tools like MRTG and iftop.
This is an old presentation salvaged from archive.
http://tree.celinuxforum.org/CelfPubWiki/JapanTechnicalJamboree13
This is English translated version. Japanese version is here.
http://www.slideshare.net/tetsu.koba/basic-of-virtual-memory-of-linux
VastSky is a cluster storage system that creates logical volumes across multiple servers and disks for virtual machines to use. It is scalable, high available and provides good performance through load balancing of I/O across physical disks. The code is now open source and it aims to integrate with Xen/XCP virtualization software to provide shared storage for virtual machines. Further improvements are planned to enhance scalability, add snapshot capabilities, and support active-active clustering configurations.
This document discusses using memory-mapped files to create a ring buffer for logging in an efficient way that retains the last log even if a process crashes. It describes mapping a file to shared memory using mmap(), filling the file, and writing logs to the ring buffer in shared memory. If the process crashes, the last log remains in the file. It addresses issues like synchronizing memory and file, and handling process forking so child processes don't overwrite the same log file.
This document describes how to configure EMC NetWorker backup server to perform NDMP backups of a Solaris system using the Solaris SUNWndmp client. Key steps include installing the SUNWndmp packages on the Solaris client, configuring the NDMP service, creating an NDMP device resource in NetWorker for the tape device, and ensuring the NDMP password is set correctly in NetWorker. This allows NetWorker to initiate NDMP backups of the Solaris client to the tape device via the SUNWndmp NDMP daemon without requiring a NetWorker agent on the Solaris system.
Namespaces, Cgroups and systemd document discusses:
1. Namespaces and cgroups which provide isolation and resource management capabilities in Linux.
2. Systemd which is a system and service manager that aims to boot faster and improve dependencies between services.
3. Key components of systemd include unit files, systemctl, and tools to manage services, devices, mounts and other resources.
Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...Tsukasa Oi
This document discusses creating secure virtual machines through techniques like setting breakpoints using debug registers or page table modifications. It compares Intel and AMD virtualization technologies, specifically how AMD-V can intercept the IRET instruction while both support using debug registers or page tables for breakpoints. Full virtualization of x86 on x86_64 architectures is also discussed as a way to do instruction tracing for purposes like malware analysis and reverse engineering. Limitations include supporting x86 segmentation and needing very fast storage for tracing large amounts of data.
This document discusses real-time operating systems (RTOS). It defines an RTOS as a multitasking OS that meets time deadlines and functions in real-time constraints. The document outlines RTOS architecture, including the kernel that provides abstraction between software and hardware. It also discusses RTOS features like tasks, scheduling, timers, memory management, and inter-task communication methods. Examples of RTOS applications include medical devices, aircraft control systems, and automotive components.
This document discusses using Docker containers without Docker. It provides a Vagrantfile configuration to set up a virtual machine environment for experiments. The Vagrantfile configures a Ubuntu 18.04 virtual machine with Docker, Go, and other tools installed. The document then covers mounting namespaces and how to isolate the root filesystem of a process to emulate containers without Docker.
This document provides instructions for setting up FreeBSD jails with a shared read-only template and individual read-write partitions for each jail. It describes creating a master template with installed binaries and ports, then creating directories for each jail mounted via nullfs. Individual jail configurations are added to rc.conf and the jails are started and can be managed via jexec. Upgrades involve building a new template and restarting the jails.
Install tomcat 5.5 in debian os and deploy war fileNguyen Cao Hung
This document provides instructions for installing and configuring Tomcat and PostgreSQL on a Debian server. It includes steps to install Tomcat 5 and 7, add users, configure Java version 1.7, copy files to the server, configure properties, recompile for the Debian environment, deploy WAR files, test, and check services. It also provides commands for managing files, directories, and users in Linux.
Logical Volume Management ("LVM") on linux looks like a complicated mess at first. The basics are not all that hard, and some features like mirroring, dynamic space management, snapshots for stable backups, mirroring, and over-provisioning via thin volumes can save a lot of time and effort.
This document discusses using Btrfs and Snapper to enable full system rollbacks in Linux. It describes how snapshots are automatically created to capture the state of the system before changes. Using Snapper, administrators can rollback the entire system to a previous snapshot to undo changes or revert to a known good state. The document provides examples of rolling back packages, kernels and system configuration changes while ensuring system integrity and compliance.
SnapDiff is NetApp's proprietary indexing engine that identifies file differences between two snapshots. It compares the base and diff snapshots using inode-walk and returns a list of added, deleted, modified, and renamed files. Backup software vendors can integrate their solutions with SnapDiff through its API to perform incremental backups using snapshots. SnapDiff provides faster indexing compared to traditional file system crawlers or NDMP backups by identifying file changes at the inode level rather than performing a full tree walk.
The document discusses ioremap and mmap functions in Linux for mapping physical addresses into the virtual address space. Ioremap is used when physical addresses are larger than the virtual address space size. It maps physical addresses to virtual addresses that can be accessed by the CPU. Mmap allows a process to map pages of a file into virtual memory. It is useful for reducing memory copies and improving performance of file read/write operations. The document outlines the functions, flags, and flows of ioremap, mmap, and implementing a custom mmap file operation for direct physical memory mapping.
The document discusses several key topics about the FreeBSD operating system including:
- How to use the virtual consoles of FreeBSD and log into the system.
- An overview of UNIX file permissions and flags in FreeBSD.
- The default directory structure and disk organization of FreeBSD.
- How to mount and unmount file systems using the fstab file and mount/unmount commands.
- Concepts of processes, daemons, signals and killing processes.
- What shells are and how to change your default login shell.
This document discusses experimenting with cgroups in Docker containers to isolate processes. It describes installing Docker, launching an Ubuntu container with capabilities enabled, and installing cgroup tools. It then mounts the cpuset and cpu cgroup hierarchies and creates low and high cgroups. Different CPU shares are assigned to each cgroup and processes are run in each to demonstrate the CPU isolation between cgroups.
The document provides an overview of the FreeBSD operating system boot process and file hierarchy. It discusses how the boot0 and boot1 programs initialize the system and load the boot2 program. Boot2 then loads the loader, which uses a bootinfo structure to pass information to the kernel and load it into memory to start the operating system. The document also describes the standard FreeBSD file hierarchy and the purpose of the main directories like /bin, /sbin, /etc, /usr, /var, and /boot.
Lecture 7: Introduction to Quantum Chemical Simulation graduate course taught at MIT in Fall 2014 by Heather Kulik. This course covers: wavefunction theory, density functional theory, force fields and molecular dynamics and sampling.
The document provides an overview of User Mode Linux (UML), including what it is, how it works, alternatives, and how to use it. UML allows running the Linux kernel as a userspace process, enabling uses like kernel debugging, security testing, and hosting virtual servers. It works by modifying the host kernel to create separate address spaces for guest kernels and processes using hardware virtualization. Key components discussed include filesystems, networking using TUN/TAP devices, management scripts, backups using LVM snapshots and blocksync, and network monitoring tools like MRTG and iftop.
This is an old presentation salvaged from archive.
http://tree.celinuxforum.org/CelfPubWiki/JapanTechnicalJamboree13
This is English translated version. Japanese version is here.
http://www.slideshare.net/tetsu.koba/basic-of-virtual-memory-of-linux
VastSky is a cluster storage system that creates logical volumes across multiple servers and disks for virtual machines to use. It is scalable, high available and provides good performance through load balancing of I/O across physical disks. The code is now open source and it aims to integrate with Xen/XCP virtualization software to provide shared storage for virtual machines. Further improvements are planned to enhance scalability, add snapshot capabilities, and support active-active clustering configurations.
This document discusses using memory-mapped files to create a ring buffer for logging in an efficient way that retains the last log even if a process crashes. It describes mapping a file to shared memory using mmap(), filling the file, and writing logs to the ring buffer in shared memory. If the process crashes, the last log remains in the file. It addresses issues like synchronizing memory and file, and handling process forking so child processes don't overwrite the same log file.
This document describes how to configure EMC NetWorker backup server to perform NDMP backups of a Solaris system using the Solaris SUNWndmp client. Key steps include installing the SUNWndmp packages on the Solaris client, configuring the NDMP service, creating an NDMP device resource in NetWorker for the tape device, and ensuring the NDMP password is set correctly in NetWorker. This allows NetWorker to initiate NDMP backups of the Solaris client to the tape device via the SUNWndmp NDMP daemon without requiring a NetWorker agent on the Solaris system.
Namespaces, Cgroups and systemd document discusses:
1. Namespaces and cgroups which provide isolation and resource management capabilities in Linux.
2. Systemd which is a system and service manager that aims to boot faster and improve dependencies between services.
3. Key components of systemd include unit files, systemctl, and tools to manage services, devices, mounts and other resources.
Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...Tsukasa Oi
This document discusses creating secure virtual machines through techniques like setting breakpoints using debug registers or page table modifications. It compares Intel and AMD virtualization technologies, specifically how AMD-V can intercept the IRET instruction while both support using debug registers or page tables for breakpoints. Full virtualization of x86 on x86_64 architectures is also discussed as a way to do instruction tracing for purposes like malware analysis and reverse engineering. Limitations include supporting x86 segmentation and needing very fast storage for tracing large amounts of data.
This document discusses real-time operating systems (RTOS). It defines an RTOS as a multitasking OS that meets time deadlines and functions in real-time constraints. The document outlines RTOS architecture, including the kernel that provides abstraction between software and hardware. It also discusses RTOS features like tasks, scheduling, timers, memory management, and inter-task communication methods. Examples of RTOS applications include medical devices, aircraft control systems, and automotive components.
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
The document summarizes the Post Memory Corruption Memory Analysis (PMCMA) tool. PMCMA allows finding and testing exploitation scenarios resulting from invalid memory accesses. It provides a roadmap to exploitation without generating exploit code. The tool analyzes programs after crashes to overwrite memory locations in forked processes and test impact on execution flow.
Karl Grzeszczak: September Docker Presentation at MediaflyMediafly
This document provides an overview of CoreOS, an open source operating system designed for clusters and distributed systems. CoreOS is lightweight, uses Docker containers, automatically updates in a way that is quick and reliable, and has tools like etcd for service discovery and fleet for orchestrating containers across a cluster. The document includes code examples of setting up a CoreOS cluster on Vagrant and using fleet to launch and manage containers.
This presentation introduces Data Plane Development Kit overview and basics. It is a part of a Network Programming Series.
First, the presentation focuses on the network performance challenges on the modern systems by comparing modern CPUs with modern 10 Gbps ethernet links. Then it touches memory hierarchy and kernel bottlenecks.
The following part explains the main DPDK techniques, like polling, bursts, hugepages and multicore processing.
DPDK overview explains how is the DPDK application is being initialized and run, touches lockless queues (rte_ring), memory pools (rte_mempool), memory buffers (rte_mbuf), hashes (rte_hash), cuckoo hashing, longest prefix match library (rte_lpm), poll mode drivers (PMDs) and kernel NIC interface (KNI).
At the end, there are few DPDK performance tips.
Tags: access time, burst, cache, dpdk, driver, ethernet, hub, hugepage, ip, kernel, lcore, linux, memory, pmd, polling, rss, softswitch, switch, userspace, xeon
We created a Redis container from the Redis image and ran it in detached mode. We then ran another container interactively to connect to the Redis server using the host IP and exposed port. Docker creates containers with isolated filesystems, network stacks, and process spaces from images. When a container starts, Docker joins the container process to the necessary namespaces to isolate it and sets up the network and filesystem.
Modern CPUs use various techniques to improve performance such as instruction pipelining, cache memory, superscalar execution, out-of-order execution, speculative execution, and branch prediction. However, these optimizations can introduce security vulnerabilities like Spectre and Meltdown attacks which exploit side effects of speculative execution in the CPU cache to leak secret data from memory. Speculative execution may process instructions early before branch resolution, potentially loading secret data into the cache where an attacker can detect it using precise timing measurements. While fixes have been developed, fully mitigating these issues remains an ongoing challenge for CPU architecture.
- Galera is a MySQL clustering solution that provides true multi-master replication with synchronous replication and no single point of failure.
- It allows high availability, data integrity, and elastic scaling of databases across multiple nodes.
- Companies like Percona and MariaDB have integrated Galera to provide highly available database clusters.
Slackware Demystified provides an overview of the Slackware Linux distribution. It discusses Slackware's philosophy of keeping things simple and sticking close to upstream. It describes Slackware's init system, configuration files, package structure, and community support. The presentation emphasizes Slackware's minimalist approach and encourages learning through documentation rather than abstracted interfaces.
Pacemaker is a high availability cluster resource manager that can be used to provide high availability for MySQL databases. It monitors MySQL instances and replicates data between nodes using replication. If the primary MySQL node fails, Pacemaker detects the failure and fails over to the secondary node, bringing the MySQL service back online without downtime. Pacemaker manages shared storage and virtual IP failover to ensure connections are direct to the active MySQL node. It is important to monitor replication state and lag to ensure data consistency between nodes.
- The document discusses Linux network stack monitoring and configuration. It begins with definitions of key concepts like RSS, RPS, RFS, LRO, GRO, DCA, XDP and BPF.
- It then provides an overview of how the network stack works from the hardware interrupts and driver level up through routing, TCP/IP and to the socket level.
- Monitoring tools like ethtool, ftrace and /proc/interrupts are described for viewing hardware statistics, software stack traces and interrupt information.
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.
In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark.
We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation.
We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs.
At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
Direct Code Execution (DCE) is a userspace kernel network stack that allows running real network stack code in a single process. DCE provides a testing platform that enables reproducible testing, fine-grained parameter tuning, and a development framework for network protocols. It achieves this through a virtualization core layer that runs multiple network nodes within a single process, a kernel layer that replaces the kernel with a shared library, and a POSIX layer that redirects system calls to the kernel library. This allows full control and observability for testing and debugging the network stack.
I am Anne L. I am an Operating System Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. in Programming, Auburn University, USA. I have been helping students with their homework for the past 8 years. I solve assignments related to Operating systems.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com.
You can also call on +1 678 648 4277 for any assistance with Operating System Assignments.
This document discusses the Java Memory Model (JMM). It begins by introducing the goals of familiarizing the attendee with the JMM, how processors work, and how the Java compiler and JVM work. It then covers key topics like data races, synchronization, atomicity, and examples. The document provides examples of correctly synchronized programs versus programs with data races. It explains concepts like happens-before ordering, volatile variables, and atomic operations. It also discusses weaknesses in some common multi-threading constructs like double-checked locking and discusses how constructs like final fields can enable safe publication of shared objects. The document concludes by mentioning planned improvements to the JMM in JEP 188.
NUSE is a library implementation of a network stack in userspace that allows new protocols and implementations to be added more quickly without modifying the kernel. It works by hijacking system calls related to networking at the library level, running the network stack code in a separate execution context using lightweight virtualization, and connecting to the network interface using options like raw sockets, DPDK, or netmap. This approach avoids the slow evolution of making kernel changes and allows network stacks and applications to be updated and deployed more flexibly on a per-application basis.
Windows 2000 is a 32-bit operating system designed for compatibility, reliability, and performance. It includes several key components like the kernel, executive services, and environmental subsystems. The kernel schedules threads and handles exceptions/interrupts. Executive services include the object manager, virtual memory manager, process manager, and I/O manager. Environmental subsystems allow running applications from other operating systems. The document also discusses disk structure, file systems, networking, and other OS concepts.
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
A talk given at Open Container Day at O'Reilly's OSCON convention in Austin, Texas on May 9th, 2017. This talk describes an open source project, bucketbench, which can be used to compare performance, stability, and throughput of various container engines. Bucketbench currently supports docker, containerd, and runc, but can be extended to support any container runtime. This work was done in response to performance investigations by the Apache OpenWhisk team in using containers as the execution vehicle for functions in their "Functions-as-a-Service" runtime. Find out more about bucketbench here: https://github.com/estesp/bucketbench
Similar to Modern CPUs and Caches - A Starting Point for Programmers (20)
2. Some notes about the subject
CPUs and their gimmicks
Caches and their importance
How CPU and OS handle memory logically
http://yaserzt.com/blog/ 2
3. These are very complex subjects
Expect very few details and much simplification
These are very complicated subjects
Expect much generalization and omission
No time
Even a full course would be hilariously insufficient
Not an expert
Sorry! Can’t help much.
Just a pile of loosely related stuff
http://yaserzt.com/blog/ 3
4. Pressure for performance
Backwards compatibility
Cost/power/etc.
The ridiculous “numbers game”
Law of diminishing returns
Latency vs. Throughput
http://yaserzt.com/blog/ 4
5. You can always solve your bandwidth
(throughput) problems with money, but it is
rarely so for lag (latency.)
Relative rate of improvements (from David
Patterson’s keynote, HPEC 2004)
CPU, 80286 till Pentium 4: 21x vs. 2250x
Ethernet, 10Mb till 10Gb: 16x vs. 1000x
Disk, 3600 till 15000rpm: 8x vs. 143x
DRAM, plain till DDR: 4x vs. 120x
http://yaserzt.com/blog/ 5
6. At the simplest level, the von Neumann
model stipulates:
Program is data and is stored in memory along
with data (departing from Turing’s model)
Program is executed sequentially
Not the way computers function anymore…
Abstraction still used for thinking about programs
But it’s leaky as heck!
“Not Your Fathers’ von Neumann Machine!”
http://yaserzt.com/blog/ 6
7. Speed of Light: can’t send and receive signals to
and from all parts of the die in a cycle anymore
Power: more transistors leads to more power,
which leads to much more heat
Memory: the CPU isn’t even close to the
bottleneck anymore. “All your base are belong
to” memory
Complexity: adding more transistors for more
sophisticated operation won’t give much of a
speedup (e.g. doubling transistors might give
2%.)
http://yaserzt.com/blog/ 7
8. Family introduced with 8086 in 1978
Today, new members are still fully binary
backward-compatible with that puny machine
(5MHz clock, 20-bit addressing, 16-bit regs.)
It had very few registers
It had segmented memory addressing (joy!)
It had many complex instructions and several
addressing modes
http://yaserzt.com/blog/ 8
10. Registers got expanded from (all 16 bit, non really
general purpose)
AX, BX, CX, DX
SI, DI, BP, SP
CS, DS, ES, SS, Flags, IP
To
16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI,
R8-R15) plus RIP and Flags and others
16 x 128-bit XMM regs. (XMM0-...)
▪ Or 16 x 256-bit YMM regs. (YMM0-...)
More than a thousand logically different instructions (the
usual, plus string processing, cryptography, CRC, complex
numbers, etc.)
http://yaserzt.com/blog/ 10
11. The Fetch-Decode-Execute-Retire Cycle
Strategies for more performance:
More complex instructions, doing more in
hardware (CISCing things up)
Faster CPU clock rates (the free lunch)
Instruction-Level Parallelism (SIMD + gimmicks)
Adding cores (free lunch is over!)
And then, there are gimmicks…
http://yaserzt.com/blog/ 11
13. Classic sequential execution:
Length of instruction executions vary a lot (5-10
times usual, several orders of magnitude also
happen.)
Instruction 1
Instruction 2
Instruction 3
Instruction 4
http://yaserzt.com/blog/ 13
14. It’s really more like this for the CPU:
Instructions may have many sub-parts, and they
engage different parts of the CPU
F1 D1 E1 R1
F2 D2 E2 R2
F3 D3 E3 R3
F4 D4 E4 R4
http://yaserzt.com/blog/ 14
15. So why not do this:
This is called “pipelining”
It increases throughput (significantly)
Doesn’t decrease latency for single instructions
F1 D1 E1 R1
F2 D2 E2 R2
F3 D3 E3 R3
F4 D4 E4 R4
http://yaserzt.com/blog/ 15
16. But it has its own share of problems
Hazards, stalls, flushing, etc.
Execution of i2 depends on the result of i1
After i2, we jump and the i3, i4,… are flushed out
F1 D1 E1 R1 add EAX,120
F2 D2 E2 R2 jmp [EAX]
F3 D3 E3 R3 mov [4*EBX+42],EDX
F4 D4 E4 R4 add ECX,[EAX]
http://yaserzt.com/blog/ 16
17. Instructions are broken up into simple,
orthogonal µ-ops
mov EAX,EDX might generate only one µ-op
mov EAX,[EDX] might generate two:
1. µld tmp0,[EDX]
2. µmov EAX,tmp0
add [EAX],EDX probably generates three:
1. µld tmp0,[EAX]
2. µadd tmp0,EDX
3. µst [EAX],tmp0
http://yaserzt.com/blog/ 17
18. The CPU then, gets two layers:
The one that breaks up operations into µ-ops
The one that executes µ-ops
The part that executes µ-ops can be simpler
(more RISCy) and therefore faster.
More complex instructions can be supported
without (much) complicating the CPU
The pipelining (and other gimmicks) can
happen at the µ-op level
http://yaserzt.com/blog/ 18
19. CPUs that issue (or retire) more than one
instruction per cycle are called Superscalar
Can be thought of as a pipeline with more
than one line
Simplest form: integer pipe plus floating-point
pipe
These days, CPUs do 4 or more
Obviously requires more of each type of
operational unit in the CPU
http://yaserzt.com/blog/ 19
20. To prevent your pipeline from stalling as
much as possible, issue the next instructions
even if you can’t start the current one.
But of course, only if there are no hazards
(dependencies) and there are operational
units available.
add RAX,RAX
add RAX,RBX This can be and is started before
the previous instruction.
add RCX,RDX
http://yaserzt.com/blog/ 20
21. This obviously also applies at the µ-op level:
mov RAX,[mem0] Fetching mem1 is started long
mul RAX,42 before the result of the
multiply becomes available.
add RAX,[mem1]
push RAX
Pushing RAX is sub RSP,8 and then
call Func mov [RSP],RAX. Since call
instruction needs RSP too, it will only
wait for the subtraction and not the
store to finish to start.
http://yaserzt.com/blog/ 21
22. Consider this:
mov RAX,[mem0]
mul RAX,42
mov [mem1],RAX
mov RAX,[mem2]
add RAX,7
mov [mem3],RAX
Logically, the two parts are totally separate.
However, the use of RAX will stall the pipeline.
http://yaserzt.com/blog/ 22
23. Modern CPUs have a lot of temporary,
unnamed registers at their disposal.
They will detect the logical independence,
and will use one of those in the second block
instead on RAX.
And they will track which reg. is which, where.
In effect, they are renaming another register
to RAX.
There might not even be a real RAX!
http://yaserzt.com/blog/ 23
24. This is, for once, simpler than it might seem!
Every time a register is assigned to, a new
temporary register is used in its stead.
Consider this:
Rename happens
mov RAX,[cached]
mov RBX,[uncached]
Renaming on mul means
add RBX,RAX that it won’t clobber RAX
mul RAX,42 (which we need for the
add, that is waiting on the
mov [mem0],RAX load of [uncached]) and we
mov [mem1],RBX can do the multiply and
reach the first store much
sooner.
http://yaserzt.com/blog/ 24
25. The CPU always depends on knowing where
the next instruction is, so it can go ahead and
work on it.
That’s why branches in code are anathema to
modern, deep pipelines and all the gimmicks
they pull.
Only if the CPU could somehow guess where
the target of each branch is going to be…
That’s where branch prediction comes in.
http://yaserzt.com/blog/ 25
26. So the CPU guesses the target of a jump (if it
doesn’t know for sure,) and continues to
speculatively execute instructions from there.
For a conditional jump, the CPU must also
predict whether the branch is taken or not.
If the CPU is right, the pipeline flows
smoothly. If not, the pipeline must be flushed
and much time and resource is wasted on a
misprediction.
http://yaserzt.com/blog/ 26
27. In this code:
cmp RAX,0
jne [RBX]
both the target and whether the jump happens
or not must be predicted.
The above can effectively jump anywhere!
But usually branches are closer to this:
cmp RAX,0
jne somewhere_specific
Which can only have two possible targets.
http://yaserzt.com/blog/ 27
28. In a simple form, when a branch is executed,
its target is stored in a table called the BTB (or
Branch Target Buffer.) When that branch is
encountered again, the target address is
predicted to be the value read from the BTB.
As you might guess, this doesn’t work for
many situations (e.g. alternating branch.)
Also, the size of the BTB is limited, so CPU will
forget about the last target of some jumps.
http://yaserzt.com/blog/ 28
29. A simple expansion on the previous idea is to use a
saturating counter along with each entry of the BTB.
For example, with a 2-bit counter,
Branch is predicted not to be taken if the counter is 0 or 1.
The branch is predicted to be taken if the counter is 2 or 3.
Each time it is taken, counter is incremented, and vice versa.
T T T
Strongly Weakly
Weakly Strongly
NT Not Not T
Taken Taken
Taken Taken
NT NT NT
http://yaserzt.com/blog/ 29
30. But this behaves very badly in common situations.
For an alternating branch,
If the counter starts in 00 or 11, it will mispredict 50%.
If the counter starts in 01, and the first time the branch
is taken, it will mispredict 100%!
As an improvement, we can store the history of
the last N occurrences of the branch in the BTB,
and use 2N counters for each of the possible
history patterns.
http://yaserzt.com/blog/ 30
31. For N=4 and 2-bit counters, we’ll have:
This is an extremely cool method of doing branch
prediction!
Branch History Prediction
.
(0 or 1)
0010 .
.
http://yaserzt.com/blog/ 31
32. Some predictions are simpler:
For each ret instruction, the target is somewhere
on the stack (pushed before.) Modern CPUs keep
track of return addresses in an internal return
stack buffer. Each time a call is executed, an
entry is added and is used for the return address.
On a cold encounter (a.k.a. static prediction) a
branch is sometimes predicted to
▪ fall through if it goes forward.
▪ be taken if it goes backward.
http://yaserzt.com/blog/ 32
33. Best general advice is to arrange your code so
that the most common path for branches is
“not taken”. This improves the effectiveness
of code prefetching and the trace cache.
Branch prediction, register renaming and
speculative execution work extremely well
together.
http://yaserzt.com/blog/ 33
35. Clock 0 – Instruction 0
mov RAX,[RBX+16] Load RAX from memory
add RBX,16 Assume cache miss – 300
cmp RAX,0 cycles to load
Instruction starts and
je IsNull
dispatch continues...
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 35
36. Clock 0 – Instruction 1
mov RAX,[RBX+16] This instruction writes RBX,
add RBX,16 which conflicts with the
cmp RAX,0 read in instruction 0.
Rename this instance of
je IsNull
RBX and continue…
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 36
37. Clock 0 – Instruction 2
mov RAX,[RBX+16] Value of RAX not available
add RBX,16 yet; cannot calculate value
cmp RAX,0 of Flags reg.
Queue up behind
je IsNull
instruction 0…
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 37
38. Clock 0 – Instruction 3
mov RAX,[RBX+16] Flags reg. still not available.
add RBX,16 Predict that this branch is
cmp RAX,0 not taken.
Assuming 4-wide dispatch,
je IsNull
instruction issue limit is
mov [RBX-16],RCX reached.
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 38
39. Clock 1 – Instruction 4
mov RAX,[RBX+16] Store is speculative. Result
add RBX,16 kept in Store Buffer. Also,
cmp RAX,0 RBX might not be available
yet (from instruction 1.)
je IsNull
Load/Store Unit is tied up
mov [RBX-16],RCX from now on; can’t issue
mov RCX,[RDX+0] any more memory ops in
mov RAX,[RAX+8] this cycle.
http://yaserzt.com/blog/ 39
40. Clock 2 – Instruction 5
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0 Had to wait for L/S Unit.
je IsNull Assume this is another (and
unrelated) cache miss. We
mov [RBX-16],RCX
have 2 overlapping cache
mov RCX,[RDX+0] misses now.
mov RAX,[RAX+8] L/S Unit is busy again.
http://yaserzt.com/blog/ 40
41. Clock 3 – Instruction 6
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
RAX is not ready yet (300-
mov [RBX-16],RCX
cycle latency, remember?!)
mov RCX,[RDX+0] This load cannot even start
mov RAX,[RAX+8] until instruction 0 is done.
http://yaserzt.com/blog/ 41
42. Clock 301 – Instruction 2
mov RAX,[RBX+16]
add RBX,16 At clock 300 (or 301,) RAX is
cmp RAX,0 finally ready.
je IsNull Do the comparison and
update Flags register.
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 42
43. Clock 301 – Instruction 6
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull Issue this load too. Assume
mov [RBX-16],RCX a cache hit (finally!) Result
mov RCX,[RDX+0] will be available in clock
mov RAX,[RAX+8] 304.
http://yaserzt.com/blog/ 43
44. Clock 302 – Instruction 3
mov RAX,[RBX+16]
add RBX,16 Now the Flags reg. is ready.
cmp RAX,0 Check the prediction.
je IsNull Assume prediction was
correct.
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 44
45. Clock 302 – Instruction 4
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull This speculative store can
actually be committed to
mov [RBX-16],RCX memory (or cache,
mov RCX,[RDX+0] actually.)
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 45
46. Clock 302 – Instruction 5
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX At clock 302, the result of
mov RCX,[RDX+0] this load arrives.
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 46
47. Clock 305 – Instruction 6
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX
mov RCX,[RDX+0] Result arrived at clock 304;
mov RAX,[RAX+8] instruction retired at 305.
http://yaserzt.com/blog/ 47
48. To summarize,
mov RAX,[RBX+16] • In 4 clocks, started 7 ops
add RBX,16 and 2 cache misses
cmp RAX,0 • Retired 7 ops in 306 cycles.
• Cache misses totally
je IsNull
dominate performance.
mov [RBX-16],RCX • The only real benefit came
mov RCX,[RDX+0] from being able to have 2
mov RAX,[RAX+8] overlapping cache misses!
http://yaserzt.com/blog/ 48
49. To get to the next cache
miss as early as possible.
http://yaserzt.com/blog/ 49
50. Main memory is slow; S.L.O.W.
Very slow
Painfully slow
And it specially has very bad (high) latency
But all is not lost! Many (most) references to
memory have high temporal and address locality.
So we use a small amount of very fast memory to
keep recently-accessed or likely-to-be-accessed
chunks of main memory close to CPU.
http://yaserzt.com/blog/ 50
51. Typically come is several levels (3 these days.)
Each lower level is several times smaller, but
several times faster than the level above.
CPU can only see the L1 cache, each level only
sees the level above, and only the highest
level can communicate with main memory.
Data is transferred between memory and
cache in units of fixed size, called a cache line.
The most common size today is 64 bytes.
http://yaserzt.com/blog/ 51
52. When any memory byte is Main Memory
needed, its place in cache is Each block is the
calculated; size of a cache line
CPU asks the cache;
If there, the cache returns the The Cache
data; Each block also
If not, the data is pulled in holds metadata
from memory; like tag (address)
If the calculated cache line is and some flags
occupied by data with a
different tag, that data is
evicted.
If the line is dirty (modified) it
is written back to memory
first.
http://yaserzt.com/blog/ 52
53. In this basic model, if the CPU periodically
accesses memory addresses that differ by a
multiple of the cache size, they will constantly
evict each other out and most cache accesses
will be misses. This is called cache thrashing.
An application can innocently and very easily
trigger this.
http://yaserzt.com/blog/ 53
54. To alleviate this problem, each cache block is
turned into an associative memory that can
house more than one cache line.
Each cache block holds more cache lines (2, 4,
8 or more,) and still uses the tag to look up
the line requested by the CPU in the block.
When a new line comes in from memory, an
LRU (or similar) policy is used to evict only the
least-likely-to-be-needed line.
http://yaserzt.com/blog/ 54
55. References:
Patterson & Hennessy - Computer Organization and Design
Intel 64 and IA-32 Architectures Software Developer’s
Manual – vol. 1, 2 and 3
Click & Goetz – A Crash Course in Modern Hardware
Agner Fog - The Microarchitecture of Intel, AMD and VIA
CPUs
Drepper - What Every Programmer Should Know About
Memory
http://yaserzt.com/blog/ 55