This document summarizes early evaluation results of the Cray XT3 supercomputer installed at Oak Ridge National Laboratory. It describes the system architecture including the AMD Opteron processors, SeaStar interconnect, and Catamount lightweight operating system. It outlines the evaluation approach and benchmarks used, including micro-benchmarks, kernel benchmarks, and full application performance tests in areas like climate and fusion simulations. Initial results show scaling of applications to 4,096 processors and comparisons to other systems like the Cray X1, XD1, SGI Altix, IBM Blue Gene/L, and Earth Simulator.
This document provides a status update on Oak Ridge National Laboratory's evaluation of the Cray XT3 supercomputer. It describes the XT3 system architecture including its AMD Opteron processors, SeaStar interconnect, and lightweight operating system. It also summarizes performance results from microbenchmarks, kernels, and applications in areas like climate, biology, astrophysics, combustion, and fusion on up to 4,096 processors, demonstrating the XT3's competitive performance, interconnect bandwidth, and parallel efficiency.
This document presents benchmarks to analyze the memory subsystem performance of multicore processors from AMD and Intel. The benchmarks measure latency and bandwidth for different cache coherence states and locations in the memory hierarchy. Testing was done on dual-socket systems using AMD Opteron 2300 (Shanghai) and Intel Xeon 5500 (Nehalem-EP) quad-core processors. Results show significant performance differences driven by each processor's distinct cache architecture and coherence protocol implementations.
OMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELSIJCNCJournal
We introduce a family of authenticated data structures — Ordered Merkle Trees (OMT) — and illustrate
their utility in security kernels for a wide variety of sub-systems. Specifically, the utility of two types of
OMTs: a) the index ordered merkle tree (IOMT) and b) the range ordered merkle tree (ROMT), are
investigated for their suitability in security kernels for various sub-systems of Border Gateway Protocol
(BGP), the Internet’s inter-autonomous system routing infrastructure. We outline simple generic security
kernel functions to maintain OMTs, and sub-system specific security kernel functionality for BGP subsystems
(like registries, autonomous system owners, and BGP speakers/routers), that take advantage of
OMTs.
This document summarizes a paper that proposes and evaluates the performance of a multithreaded architecture capable of exploiting both coarse-grained parallelism and fine-grained instruction-level parallelism. The architecture distributes processing across multiple processing elements connected by an interconnection network. Each processing element supports multiple concurrently executing threads by grouping instructions from different threads. The architecture introduces a distributed data structure cache to reduce network latency when accessing remote data. Simulation results indicate the architecture achieves high processor throughput and the data structure cache significantly reduces network latency.
The document summarizes the memory performance limitations of Intel Xeon microprocessors based on the Nehalem and Westmere architectures. It finds that per core and per thread memory bandwidth is restricted to about 1/3 of theoretical maximum values. Moving from Nehalem to Westmere, read performance scales well but write performance suffers, revealing scalability issues in Westmere's design. The document aims to provide an accurate analysis of memory bandwidth and latency limitations to help application developers optimize code efficiency.
Exploiting Multi Core Architectures for Process Speed UpIJERD Editor
The document discusses exploiting multi-core architectures to speed up processing of satellite data. It describes developing a multithreaded application using a master-slave model to parallelize preprocessing steps like extracting auxiliary data, Reed-Solomon decoding, and decompression. Performance analysis shows the application scales dynamically based on system resources and achieves maximum speedup when assigning 7 threads per core. While theoretical speedup is limited by Amdahl's law, accounting for parallelization overhead provides a more realistic performance measure.
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSijcsit
This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X
processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for multi-core processors. Then, it explains the multi-core processor machine and the issues that rises in implementing them. It also provides a real life example machines such as
TILEPro64 and Epiphany-IV 64-core 28nm Microprocessor (E64G401). The methodology that was used in comparing the Intel core i7 and AMD phenom II processors starts by explaining how processors' performance are measured, then by listing the most important and relevant technical specification to the
comparison. After that, running the comparison by using different metrics such as power, the use of HyperThreading
technology, the operating frequency, the use of AES encryption and decryption, and the different characteristics of cache memory such as the size, classification, and its memory controller. Finally, reaching to a roughly decision about which one of them has a better over all performance.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
This document provides a status update on Oak Ridge National Laboratory's evaluation of the Cray XT3 supercomputer. It describes the XT3 system architecture including its AMD Opteron processors, SeaStar interconnect, and lightweight operating system. It also summarizes performance results from microbenchmarks, kernels, and applications in areas like climate, biology, astrophysics, combustion, and fusion on up to 4,096 processors, demonstrating the XT3's competitive performance, interconnect bandwidth, and parallel efficiency.
This document presents benchmarks to analyze the memory subsystem performance of multicore processors from AMD and Intel. The benchmarks measure latency and bandwidth for different cache coherence states and locations in the memory hierarchy. Testing was done on dual-socket systems using AMD Opteron 2300 (Shanghai) and Intel Xeon 5500 (Nehalem-EP) quad-core processors. Results show significant performance differences driven by each processor's distinct cache architecture and coherence protocol implementations.
OMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELSIJCNCJournal
We introduce a family of authenticated data structures — Ordered Merkle Trees (OMT) — and illustrate
their utility in security kernels for a wide variety of sub-systems. Specifically, the utility of two types of
OMTs: a) the index ordered merkle tree (IOMT) and b) the range ordered merkle tree (ROMT), are
investigated for their suitability in security kernels for various sub-systems of Border Gateway Protocol
(BGP), the Internet’s inter-autonomous system routing infrastructure. We outline simple generic security
kernel functions to maintain OMTs, and sub-system specific security kernel functionality for BGP subsystems
(like registries, autonomous system owners, and BGP speakers/routers), that take advantage of
OMTs.
This document summarizes a paper that proposes and evaluates the performance of a multithreaded architecture capable of exploiting both coarse-grained parallelism and fine-grained instruction-level parallelism. The architecture distributes processing across multiple processing elements connected by an interconnection network. Each processing element supports multiple concurrently executing threads by grouping instructions from different threads. The architecture introduces a distributed data structure cache to reduce network latency when accessing remote data. Simulation results indicate the architecture achieves high processor throughput and the data structure cache significantly reduces network latency.
The document summarizes the memory performance limitations of Intel Xeon microprocessors based on the Nehalem and Westmere architectures. It finds that per core and per thread memory bandwidth is restricted to about 1/3 of theoretical maximum values. Moving from Nehalem to Westmere, read performance scales well but write performance suffers, revealing scalability issues in Westmere's design. The document aims to provide an accurate analysis of memory bandwidth and latency limitations to help application developers optimize code efficiency.
Exploiting Multi Core Architectures for Process Speed UpIJERD Editor
The document discusses exploiting multi-core architectures to speed up processing of satellite data. It describes developing a multithreaded application using a master-slave model to parallelize preprocessing steps like extracting auxiliary data, Reed-Solomon decoding, and decompression. Performance analysis shows the application scales dynamically based on system resources and achieves maximum speedup when assigning 7 threads per core. While theoretical speedup is limited by Amdahl's law, accounting for parallelization overhead provides a more realistic performance measure.
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSijcsit
This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X
processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for multi-core processors. Then, it explains the multi-core processor machine and the issues that rises in implementing them. It also provides a real life example machines such as
TILEPro64 and Epiphany-IV 64-core 28nm Microprocessor (E64G401). The methodology that was used in comparing the Intel core i7 and AMD phenom II processors starts by explaining how processors' performance are measured, then by listing the most important and relevant technical specification to the
comparison. After that, running the comparison by using different metrics such as power, the use of HyperThreading
technology, the operating frequency, the use of AES encryption and decryption, and the different characteristics of cache memory such as the size, classification, and its memory controller. Finally, reaching to a roughly decision about which one of them has a better over all performance.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
This document discusses Intel's Hyper-Threading Technology, which allows a single physical processor core to appear and function as two logical processors to the operating system. It does this by duplicating the core's architectural state and partitioning its execution resources between the two logical processors. This allows both logical processors to execute instructions simultaneously by sharing execution units, caches, and other resources. The document provides details on how the front-end, execution engine, registers, buffers, caches and other components function for both logical processors simultaneously through partitioning, duplication, and alternating access between the two threads.
This document provides an outline and overview of parallel computing platforms and trends in microprocessor architectures. It discusses limitations in memory system performance and dichotomy of parallel platforms. Implicit parallelism through techniques like pipelining and superscalar execution aim to improve processor performance by executing multiple instructions concurrently. However, dependencies and other factors limit efficiency. Alternative approaches like VLIW aim to simplify hardware by moving scheduling to compile time. Memory latency and bandwidth are also significant bottlenecks, as data access times are orders of magnitude slower than processor speeds. Overall parallelism seeks to address performance limitations in processors, memory, and communication.
This document discusses different types of multiple processor systems including multiprocessors, multicomputers, and distributed systems. It covers topics such as multiprocessor hardware architectures, operating systems, scheduling, communication software, remote procedure calls, distributed shared memory, and middleware for coordination between distributed systems.
This document discusses multiprocessor and multicomputer systems. It defines a multiprocessor system as having more than one processor that shares common memory, while a multicomputer has more than one processor each with local memory. Processors may be closely coupled on a shared bus or loosely coupled distributed on a network. The document also covers Flynn's taxonomy of computer architectures and examples of single instruction single data stream (SISD), single instruction multiple data stream (SIMD), multiple instruction single data stream (MISD), and multiple instruction multiple data stream (MIMD) systems.
This document summarizes a research paper that proposes a new cache coherence protocol called Phase-Priority Based (PPB) cache coherence. PPB aims to optimize directory-based cache coherence protocols for multicore processors. It introduces the concepts of "phase" and "priority" for coherence messages to reduce unnecessary transient states and message stalling. PPB differentiates messages into inner and outer phases based on their place in the coherence transaction ordering. It also prioritizes messages in the on-chip network to improve efficiency. Analysis shows PPB outperforms traditional MESI, reducing transient states and stalls by up to 24% with a 7.4% speedup.
This document describes PowerAlluxio, an in-memory file system that improves on Alluxio by enabling shared memory utilization across cluster nodes while maintaining memory locality. PowerAlluxio allows client nodes to utilize remote node memory if local memory is full, improving cluster memory usage without sacrificing performance. It also introduces a new Smart LRU eviction policy that reduces elapsed time by 24.76% for large datasets. Experiments showed PowerAlluxio achieved up to 14.11x faster task completion times compared to Alluxio when data could be fully cached.
This document discusses main memory management techniques in operating systems. It covers topics like address binding, swapping, contiguous memory allocation, segmentation, paging, and structure of page tables. The key points are:
1) Main memory management aims to efficiently allocate memory for processes while allowing logical addresses to differ from physical addresses. This is achieved through techniques like segmentation and paging.
2) Swapping allows processes to be temporarily moved from main memory to disk, allowing the total memory usage of processes to exceed physical memory. However, swapping is slow compared to other techniques.
3) Contiguous memory allocation allocates each process to a contiguous block of memory. This can lead to internal and external fragmentation over time
This document provides information about the Operating Systems course code 17CS64 at Canara Engineering College, Mangaluru. The module covers secondary storage structures including mass storage, disk structures, disk attachments, disk scheduling, disk management, and swap space management. It also covers protection goals, principles, domains, access control matrices, and capability-based systems. Finally, it discusses Linux operating system history, design, kernel modules, processes, scheduling, memory management, file systems, input/output, and inter-process communication. Web resources are provided for additional information on operating systems topics.
The Pentium Pro processor is an improvement over previous Intel processors with additional features such as a 64-bit data bus, 8KB instruction and data caches, two parallel integer execution units, and a floating point unit. It also includes an on-chip 256/512KB L2 cache, branch prediction logic, and error detection capabilities. The Core 2 Duo processor is the first Intel processor to use a dual-core design, allowing two independent processor cores to operate simultaneously and share an L2 cache and front side bus. It features technologies such as virtualization, trusted execution, and an execute disable bit for enhanced security.
This document presents an analytical model to evaluate the performance of multithreaded multiprocessors with distributed shared memory. The model uses a multi-chain closed queuing network to model the processor, memory, and network subsystems in an integrated manner. This captures the strong coupling between subsystems. The model shows that high performance is achieved when the memory request rate matches the weighted sum of memory bandwidth and average remote memory access distance. The model is validated using a stochastic timed Petri net model.
The document discusses several topics related to operating system I/O subsystems:
- It describes kernel I/O services like buffering, caching, spooling, scheduling I/O requests, and handling errors.
- It explains disk scheduling algorithms like SSTF, SCAN, C-SCAN, and C-LOOK that aim to minimize disk head movement.
- It covers disk management topics such as disk formatting, initialization, and recovering bad blocks.
Intel's Nehalem Microarchitecture by Glenn Hintonparallellabs
Intel's Nehalem family of CPUs span from large multi-socket 32 core/64 thread systems to ultra small form factor laptops. What were some of the key tradeoffs in architecting and developing the Nehalem family of CPUs? What pipeline should it use? Should it optimize for servers? For desktops? For Laptops? There are lots of tradeoffs here. This talk will discuss some of the tradeoffs and results.
This document provides an overview of the Operating Systems course code 17CS64 at Canara Engineering College. The course is taught in the 6th semester and has 10 hours of content. Module 1 covers introductions to operating systems, including what they do, computer system organization and architecture, OS structure and operations, processes, memory management, storage management, security, distributed systems, and computing environments. It also discusses OS structures, services, interfaces, system calls, programs, design, and boot processes. Process management concepts like processes, scheduling, and inter-process communication are also introduced. Web resources for further reading on operating systems are provided.
Microarchitecture refers to how an instruction set architecture is implemented in a processor. It focuses on aspects like chip area, power consumption, and complexity. Nehalem was Intel's latest microarchitecture at the time, featuring an integrated memory controller, QuickPath interconnect, and improvements in performance and power efficiency over previous architectures. Its successors included Westmere, Sandy Bridge, and Haswell.
This document contains the answers to several questions about memory management techniques. It compares internal and external fragmentation, discusses how a linkage editor changes binding of instructions and data, and analyzes how first-fit, best-fit, and worst-fit placing algorithms handle sample processes. It also examines the requirements for dynamic memory allocation in different schemes and compares schemes in terms of issues like fragmentation and code sharing.
This document discusses the key components of a file system and disk drive architecture. It covers disk structure, scheduling, management, and swap space. It describes concepts like disk formatting, boot blocks, bad blocks, and swap space location. The document also provides a high-level overview of the Windows 2000 operating system, covering its architecture, kernel, processes and threads, exceptions, and subsystems.
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Léia de Sousa
O documento explica como a memória cache funciona para melhorar o desempenho dos processadores, que continuam se tornando mais rápidos que a memória RAM. A memória cache armazena temporariamente dados frequentemente usados pelo processador para reduzir o tempo de acesso à memória RAM. Processadores modernos usam vários níveis de cache, com o cache L1 sendo pequeno e rápido e o cache L2 maior e mais lento.
1. O documento discute algoritmos geométricos utilizados em sistemas de informação geográfica (SIG). 2. É apresentada uma revisão de conceitos básicos de geometria computacional aplicados a SIG, incluindo definições de pontos, linhas, polígonos e classes de vetores. 3. Também são definidos tipos abstratos de dados para representar entidades geométricas em SIG.
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Léia de Sousa
The document discusses optimizing the data cache performance of a software MPEG-2 video decoder. It analyzes the memory and cache behavior of software MPEG-2 decoding. Various cache-oriented architectural enhancements are proposed that could reduce excess cache-memory traffic by 50% or more, such as selectively caching specific data types based on their sizes, access types, and access patterns. Trace-driven cache simulations were performed on a software MPEG-2 decoder to analyze the effects of cache parameters and evaluate the proposed enhancements.
O documento discute o problema da coerência de memória cache em sistemas multiprocessados e formas de resolvê-lo. É apresentado o problema da coerência de cache e como ele pode ser resolvido através de esquemas baseados em software ou hardware. Protocolos comuns como snoopy e directory são explicados.
This document discusses models for predicting the future of multi-core processors and whether they can continue to provide performance scaling according to Moore's Law. It presents the Device Scaling Model (DevM) to predict characteristics of future process nodes, the Single Core Scaling Model (CorM) to model area/performance and power/performance, and the Multi-Core Scaling Models (CmpM) including an upper-bound Amdahl's Law model (CmpMU) and a more realistic model (CmpMR). Combining these models shows that a 32x performance improvement from 2008 to 2024 is not possible due to issues like increasing "dark silicon" power inefficiencies at future nodes. The most optimistic ITRS projection
This document discusses Intel's Hyper-Threading Technology, which allows a single physical processor core to appear and function as two logical processors to the operating system. It does this by duplicating the core's architectural state and partitioning its execution resources between the two logical processors. This allows both logical processors to execute instructions simultaneously by sharing execution units, caches, and other resources. The document provides details on how the front-end, execution engine, registers, buffers, caches and other components function for both logical processors simultaneously through partitioning, duplication, and alternating access between the two threads.
This document provides an outline and overview of parallel computing platforms and trends in microprocessor architectures. It discusses limitations in memory system performance and dichotomy of parallel platforms. Implicit parallelism through techniques like pipelining and superscalar execution aim to improve processor performance by executing multiple instructions concurrently. However, dependencies and other factors limit efficiency. Alternative approaches like VLIW aim to simplify hardware by moving scheduling to compile time. Memory latency and bandwidth are also significant bottlenecks, as data access times are orders of magnitude slower than processor speeds. Overall parallelism seeks to address performance limitations in processors, memory, and communication.
This document discusses different types of multiple processor systems including multiprocessors, multicomputers, and distributed systems. It covers topics such as multiprocessor hardware architectures, operating systems, scheduling, communication software, remote procedure calls, distributed shared memory, and middleware for coordination between distributed systems.
This document discusses multiprocessor and multicomputer systems. It defines a multiprocessor system as having more than one processor that shares common memory, while a multicomputer has more than one processor each with local memory. Processors may be closely coupled on a shared bus or loosely coupled distributed on a network. The document also covers Flynn's taxonomy of computer architectures and examples of single instruction single data stream (SISD), single instruction multiple data stream (SIMD), multiple instruction single data stream (MISD), and multiple instruction multiple data stream (MIMD) systems.
This document summarizes a research paper that proposes a new cache coherence protocol called Phase-Priority Based (PPB) cache coherence. PPB aims to optimize directory-based cache coherence protocols for multicore processors. It introduces the concepts of "phase" and "priority" for coherence messages to reduce unnecessary transient states and message stalling. PPB differentiates messages into inner and outer phases based on their place in the coherence transaction ordering. It also prioritizes messages in the on-chip network to improve efficiency. Analysis shows PPB outperforms traditional MESI, reducing transient states and stalls by up to 24% with a 7.4% speedup.
This document describes PowerAlluxio, an in-memory file system that improves on Alluxio by enabling shared memory utilization across cluster nodes while maintaining memory locality. PowerAlluxio allows client nodes to utilize remote node memory if local memory is full, improving cluster memory usage without sacrificing performance. It also introduces a new Smart LRU eviction policy that reduces elapsed time by 24.76% for large datasets. Experiments showed PowerAlluxio achieved up to 14.11x faster task completion times compared to Alluxio when data could be fully cached.
This document discusses main memory management techniques in operating systems. It covers topics like address binding, swapping, contiguous memory allocation, segmentation, paging, and structure of page tables. The key points are:
1) Main memory management aims to efficiently allocate memory for processes while allowing logical addresses to differ from physical addresses. This is achieved through techniques like segmentation and paging.
2) Swapping allows processes to be temporarily moved from main memory to disk, allowing the total memory usage of processes to exceed physical memory. However, swapping is slow compared to other techniques.
3) Contiguous memory allocation allocates each process to a contiguous block of memory. This can lead to internal and external fragmentation over time
This document provides information about the Operating Systems course code 17CS64 at Canara Engineering College, Mangaluru. The module covers secondary storage structures including mass storage, disk structures, disk attachments, disk scheduling, disk management, and swap space management. It also covers protection goals, principles, domains, access control matrices, and capability-based systems. Finally, it discusses Linux operating system history, design, kernel modules, processes, scheduling, memory management, file systems, input/output, and inter-process communication. Web resources are provided for additional information on operating systems topics.
The Pentium Pro processor is an improvement over previous Intel processors with additional features such as a 64-bit data bus, 8KB instruction and data caches, two parallel integer execution units, and a floating point unit. It also includes an on-chip 256/512KB L2 cache, branch prediction logic, and error detection capabilities. The Core 2 Duo processor is the first Intel processor to use a dual-core design, allowing two independent processor cores to operate simultaneously and share an L2 cache and front side bus. It features technologies such as virtualization, trusted execution, and an execute disable bit for enhanced security.
This document presents an analytical model to evaluate the performance of multithreaded multiprocessors with distributed shared memory. The model uses a multi-chain closed queuing network to model the processor, memory, and network subsystems in an integrated manner. This captures the strong coupling between subsystems. The model shows that high performance is achieved when the memory request rate matches the weighted sum of memory bandwidth and average remote memory access distance. The model is validated using a stochastic timed Petri net model.
The document discusses several topics related to operating system I/O subsystems:
- It describes kernel I/O services like buffering, caching, spooling, scheduling I/O requests, and handling errors.
- It explains disk scheduling algorithms like SSTF, SCAN, C-SCAN, and C-LOOK that aim to minimize disk head movement.
- It covers disk management topics such as disk formatting, initialization, and recovering bad blocks.
Intel's Nehalem Microarchitecture by Glenn Hintonparallellabs
Intel's Nehalem family of CPUs span from large multi-socket 32 core/64 thread systems to ultra small form factor laptops. What were some of the key tradeoffs in architecting and developing the Nehalem family of CPUs? What pipeline should it use? Should it optimize for servers? For desktops? For Laptops? There are lots of tradeoffs here. This talk will discuss some of the tradeoffs and results.
This document provides an overview of the Operating Systems course code 17CS64 at Canara Engineering College. The course is taught in the 6th semester and has 10 hours of content. Module 1 covers introductions to operating systems, including what they do, computer system organization and architecture, OS structure and operations, processes, memory management, storage management, security, distributed systems, and computing environments. It also discusses OS structures, services, interfaces, system calls, programs, design, and boot processes. Process management concepts like processes, scheduling, and inter-process communication are also introduced. Web resources for further reading on operating systems are provided.
Microarchitecture refers to how an instruction set architecture is implemented in a processor. It focuses on aspects like chip area, power consumption, and complexity. Nehalem was Intel's latest microarchitecture at the time, featuring an integrated memory controller, QuickPath interconnect, and improvements in performance and power efficiency over previous architectures. Its successors included Westmere, Sandy Bridge, and Haswell.
This document contains the answers to several questions about memory management techniques. It compares internal and external fragmentation, discusses how a linkage editor changes binding of instructions and data, and analyzes how first-fit, best-fit, and worst-fit placing algorithms handle sample processes. It also examines the requirements for dynamic memory allocation in different schemes and compares schemes in terms of issues like fragmentation and code sharing.
This document discusses the key components of a file system and disk drive architecture. It covers disk structure, scheduling, management, and swap space. It describes concepts like disk formatting, boot blocks, bad blocks, and swap space location. The document also provides a high-level overview of the Windows 2000 operating system, covering its architecture, kernel, processes and threads, exceptions, and subsystems.
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Léia de Sousa
O documento explica como a memória cache funciona para melhorar o desempenho dos processadores, que continuam se tornando mais rápidos que a memória RAM. A memória cache armazena temporariamente dados frequentemente usados pelo processador para reduzir o tempo de acesso à memória RAM. Processadores modernos usam vários níveis de cache, com o cache L1 sendo pequeno e rápido e o cache L2 maior e mais lento.
1. O documento discute algoritmos geométricos utilizados em sistemas de informação geográfica (SIG). 2. É apresentada uma revisão de conceitos básicos de geometria computacional aplicados a SIG, incluindo definições de pontos, linhas, polígonos e classes de vetores. 3. Também são definidos tipos abstratos de dados para representar entidades geométricas em SIG.
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Léia de Sousa
The document discusses optimizing the data cache performance of a software MPEG-2 video decoder. It analyzes the memory and cache behavior of software MPEG-2 decoding. Various cache-oriented architectural enhancements are proposed that could reduce excess cache-memory traffic by 50% or more, such as selectively caching specific data types based on their sizes, access types, and access patterns. Trace-driven cache simulations were performed on a software MPEG-2 decoder to analyze the effects of cache parameters and evaluate the proposed enhancements.
O documento discute o problema da coerência de memória cache em sistemas multiprocessados e formas de resolvê-lo. É apresentado o problema da coerência de cache e como ele pode ser resolvido através de esquemas baseados em software ou hardware. Protocolos comuns como snoopy e directory são explicados.
This document discusses models for predicting the future of multi-core processors and whether they can continue to provide performance scaling according to Moore's Law. It presents the Device Scaling Model (DevM) to predict characteristics of future process nodes, the Single Core Scaling Model (CorM) to model area/performance and power/performance, and the Multi-Core Scaling Models (CmpM) including an upper-bound Amdahl's Law model (CmpMU) and a more realistic model (CmpMR). Combining these models shows that a 32x performance improvement from 2008 to 2024 is not possible due to issues like increasing "dark silicon" power inefficiencies at future nodes. The most optimistic ITRS projection
Dark silicon and the end of multicore scalingLéia de Sousa
This document models limits on multicore scaling over the next 5 technology generations from 45nm to 8nm. It combines models of device scaling, single-core performance scaling, and multicore performance scaling. The key findings are:
- Using optimistic ITRS projections, only 7.9x average speedup is possible across common parallel workloads, far short of the target 32x speedup from 5 generations of doubling cores.
- Power limitations mean an increasing fraction of chip area must be powered off ("dark silicon") - 21% at 22nm and over 50% at 8nm.
- Neither CPU-like nor GPU-like multicore designs can achieve expected performance levels. Radical innovations are needed
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
This document summarizes a presentation about dark silicon in mobile devices and possible open source solutions. It discusses how power and thermal constraints are more severe for mobile devices due to limited battery progress and no fans. It also covers big.LITTLE scheduling, thread-level parallelism challenges, and user-level threading libraries like AsyncTask. Finally, it notes that while some open source parallel programming frameworks exist, fully utilizing parallelism on mobile and addressing dark silicon remain challenges with no widely adopted solutions.
Growth and development of cranium and faceRajesh Bariker
The document discusses prenatal human growth and development, beginning with definitions of growth and development and covering topics such as critical periods, signaling growth factors, prenatal development including pre-implantation, embryonic, and fetal periods, postnatal development, osteogenesis, basic growth movements, theories of growth, and normal and abnormal development. It provides details on the derivation and development of structures from the germ layers and pharyngeal arches during important periods such as pre-somite, somite, and post-somite.
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
This document analyzes performance degradation of scientific applications on multi-core systems. It summarizes benchmark results from dual-core and quad-core systems running Department of Defense applications. The tests show a runtime performance penalty for multi-core runs compared to single-core runs on the same systems. This is attributed to contention for memory bandwidth between cores, as shown by synthetic memory benchmark results, consistent with reported studies. The document outlines the benchmark systems used, prior study findings on multi-core performance issues, and analysis of application and memory benchmark results on the multi-core systems.
Sun disclosed plans for its next-generation processor called UltraSparc-3, which will operate at 600MHz - twice as fast as its current fastest chips. UltraSparc-3 aims to significantly improve performance over existing chips through higher clock speeds combined with other enhancements like improved branch prediction and larger caches. It features an unusual system interface designed to deliver performance scaling linearly with the number of processors used.
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
This document proposes and evaluates several cache designs for improving performance and energy efficiency in multi-core processors. It introduces a filter cache shared among cores to reduce energy consumption. It then implements a segmented least recently used replacement policy and adaptive bypassing to further improve cache hit rates. Finally, it modifies the MOESI coherence protocol for a ring interconnect topology to address data coherence across cores. Simulations show the proposed cache designs reduce energy usage by 11% and increase cache hit rates by up to 7% compared to baseline designs.
This document evaluates the performance of four memory consistency models (sequential consistency, processor consistency, weak consistency, and release consistency) for shared-memory multiprocessors using simulation studies of three applications (MP3D, LU, and PTHOR). The results show that sequential consistency performs significantly worse than the other models. Surprisingly, processor consistency performs almost as well as release consistency and better than weak consistency for one application, indicating that allowing reads to bypass pending writes provides more benefit than allowing writes to pipeline.
This document analyzes the performance of two quad-core processors, the AMD Barcelona and Intel Xeon X7350, on scientific applications. It finds that while the Intel processor has a higher clock rate, the AMD processor has higher memory bandwidth and intra-node scalability. A suite of scientific applications were tested on each processor, showing a range of speedups from 3x to 16x over a single core. The document examines low-level benchmarks and application scaling to determine which processor configuration delivers the best performance for different workloads.
This research paper aims at comparing two multi-core processors machines, the Intel core i7-4960X processor (Ivy Bridge E) and the AMD Phenom II X6. It starts by introducing a single-core processor machine to motivate the need for multi-core processors. Then, it explains the multi-core processor machine and the issues that rises in implementing them. It also provides a real life example machines such as TILEPro64 and Epiphany-IV 64-core 28nm Microprocessor (E64G401). The methodology that was used in comparing the Intel core i7 and AMD phenom II processors starts by explaining how processors' performance are measured, then by listing the most important and relevant technical specification to the comparison. After that, running the comparison by using different metrics such as power, the use of HyperThreading technology, the operating frequency, the use of AES encryption and decryption, and the different characteristics of cache memory such as the size, classification, and its memory controller. Finally, reaching to a roughly decision about which one of them has a better over all performance.
This document provides an overview of supercomputers including their definition, the top supercomputers in the world by processing speed, India's fastest supercomputer SahasraT, proposed methodology for a prototype supercomputer using a 1-4 node cluster, common network topologies like fat tree and torus, how performance would be analyzed using benchmarking software, basic components of the prototype model, and potential applications of supercomputers in scientific research, data management, and multitasking.
International Journal of Engineering Research and DevelopmentIJERD Editor
This document summarizes the design of a virtual extended memory symmetric multiprocessor (SMP) organization using LC-3 processors. It discusses the LC-3 processor architecture and instruction set. It then describes the design of a dual core LC-3 processor that shares memory over 32K bank sizes. The key components of the LC-3 processor pipeline including fetch, decode, execute, and writeback units are defined along with their inputs, outputs, and functions. Memory architectures for SMP systems including conventional, direct connect, and shared bus approaches are also summarized.
The document discusses parallel computing platforms and trends in microprocessor architectures that enable implicit parallelism. It covers topics like pipelining, superscalar execution, limitations of memory performance, and how caches can improve effective memory latency. The key points are:
1) Microprocessor clock speeds have increased dramatically but limitations remain regarding memory latency and bandwidth. Parallelism addresses performance bottlenecks in processors, memory, and communication.
2) Techniques like pipelining and superscalar execution exploit implicit parallelism by executing multiple instructions concurrently, but dependencies and branch prediction limit performance gains.
3) Memory latency is often the bottleneck, but caches can reduce effective latency through data reuse and temporal locality.
The growing rate of technology improvements has caused dramatic advances in processor performances,
causing significant speed-up of processor working frequency and increased amount of instructions which
can be processed in parallel. The given development of processor's technology has brought performance
improvements in computer systems, but not for all the types of applications. The reason for this resides in
the well known Von-Neumann bottleneck problem which occurs during the communication between the
processor and the main memory into a standard processor-centric system. This problem has been reviewed
by many scientists, which proposed different approaches for improving the memory bandwidth and latency.
This paper provides a brief review of these techniques and also gives a deep analysis of various memory-
centric systems that implement different approaches of merging or placing the memory near to the
processing elements. Within this analysis we discuss the advantages, disadvantages and the application
(purpose) of several well-known memory-centric systems.
The growing rate of technology improvements has caused dramatic advances in processor performances,
causing significant speed-up of processor working frequency and increased amount of instructions which
can be processed in parallel. The given development of processor's technology has brought performance
improvements in computer systems, but not for all the types of applications. The reason for this resides in
the well known Von-Neumann bottleneck problem which occurs during the communication between the
processor and the main memory into a standard processor-centric system. This problem has been reviewed
by many scientists, which proposed different approaches for improving the memory bandwidth and latency.
This paper provides a brief review of these techniques and also gives a deep analysis of various memory-
centric systems that implement different approaches of merging or placing the memory near to the
processing elements. Within this analysis we discuss the advantages, disadvantages and the application
(purpose) of several well-known memory-centric systems.
The growing rate of technology improvements has caused dramatic advances in processor performances, causing significant speed-up of processor working frequency and increased amount of instructions which can be processed in parallel. The given development of processor's technology has brought performance improvements in computer systems, but not for all the types of applications. The reason for this resides in the well known Von-Neumann bottleneck problem which occurs during the communication between the processor and the main memory into a standard processor-centric system. This problem has been reviewed by many scientists, which proposed different approaches for improving the memory bandwidth and latency. This paper provides a brief review of these techniques and also gives a deep analysis of various memorycentric systems that implement different approaches of merging or placing the memory near to the processing elements. Within this analysis we discuss the advantages, disadvantages and the application (purpose) of several well-known memory-centric systems
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...ijcsit
The growing rate of technology improvements has caused dramatic advances in processor performances, causing significant speed-up of processor working frequency and increased amount of instructions which can be processed in parallel. The given development of processor's technology has brought performance
improvements in computer systems, but not for all the types of applications. The reason for this resides in the well known Von-Neumann bottleneck problem which occurs during the communication between the processor and the main memory into a standard processor-centric system. This problem has been reviewed by many scientists, which proposed different approaches for improving the memory bandwidth and latency.
This paper provides a brief review of these techniques and also gives a deep analysis of various memorycentric
systems that implement different approaches of merging or placing the memory near to the processing elements. Within this analysis we discuss the advantages, disadvantages and the application (purpose) of several well-known memory-centric systems.
The document analyzes the performance of the LEON 3FT processor at different operating frequencies. A hardware implementation using the LEON 3FT processor was tested by executing benchmark programs at various frequencies. The results show that execution time decreases with higher operating frequencies, though there is a maximum frequency limit due to hardware constraints. Future work involves attempting to increase this maximum frequency limit while maintaining processor performance.
The document discusses several types of processors including Pentium 4, dual-core, and quad-core processors, explaining their features and advantages. Pentium 4 used the NetBurst architecture but faced challenges scaling to higher speeds. Dual-core and quad-core processors place multiple processor cores on a single chip to improve performance through parallel processing while reducing power needs.
Cache memory is a fast memory located between the CPU and main memory that stores frequently accessed instructions and data. It improves system performance by reducing memory access time. Cache is organized into multiple levels - L1 cache is closest to the CPU, L2 cache is next, and some CPUs have an L3 cache. (Level 1, 2, 3 caches refer to their proximity to the CPU.) Cache memory uses SRAM instead of DRAM for faster access. It is organized into rows containing a data block, tag, and flag bits. Optimization techniques for cache include improving data locality through code transformations and maintaining coherence across cache levels.
The Pentium III was a desktop and mobile CPU produced by Intel between 1999-2003. It had clock speeds between 400 MHz to 1.4 GHz and included features like MMX and SSE instructions. There were several stepping of the Pentium III including Katmai at 0.25 μm, Coppermine at 0.18 μm, Coppermine T at 0.18 μm, and Tualatin at 0.13 μm. Each stepping improved performance through higher clock speeds, larger caches, and support for newer instruction sets. Optimizing code for the Pentium III microarchitecture required techniques like scheduling instructions to maximize decoder throughput, balancing usage of execution units, and minimizing register dependencies. The Pentium III was also notable for including
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
An ideal Network Processor, that is, a programmable multi-processor device must be capable of offering both the flexibility and speed required for packet processing. But current Network Processor systems generally fall short of the above benchmarks due to traffic fluctuations inherent in packet networks, and the resulting workload variation on individual pipeline stage over a period of time ultimately affects the overall performance of even an otherwise sound system. One potential solution would be to change the code running at these stages so as to adapt to the fluctuations; a near robust system with standing traffic fluctuations is the dynamic adaptive processor, reconfiguring the entire system, which we introduce and study to some extent in this paper. We achieve this by using a crucial decision making model, transferring the binary code to the processor through the SOAP protocol.
CXL is enabling new memory architectures by connecting CPUs and GPUs to shared memory pools. Early CXL 1.1 focused on memory expansion by connecting processors to DRAM modules. CXL 2.0 allowed for small memory pools accessible by a few servers. CXL 3.0 supports larger shared memory fabrics by connecting thousands of nodes and enabling true shared memory regions accessible coherently by multiple hosts and accelerators. However, shared memory fabrics using CXL 3.0 may experience greater latency variability and congestion compared to single-host or small memory pooling configurations.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
HCL Notes and Domino License Cost Reduction in the World of DLAU
Cray xt3
1. Published in Proc. 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2006.
Early Evaluation of the Cray XT3
Jeffrey S. Vetter Sadaf R. Alam Thomas H. Dunigan, Jr.
Mark R. Fahey Philip C. Roth Patrick H. Worley
Oak Ridge National Laboratory
Oak Ridge, TN, USA 37831
{vetter,alamsr,duniganth,faheymr,rothpc,worleyph}@ornl.gov
Abstract
Oak Ridge National Laboratory recently received
delivery of a 5,294 processor Cray XT3. The XT3 is
Cray’s third-generation massively parallel processing
system. The system builds on a single processor node—
built around the AMD Opteron—and uses a custom chip—
called SeaStar—to provide interprocessor communication.
In addition, the system uses a lightweight operating system
on the compute nodes. This paper describes our initial
experiences with the system, including micro-benchmark,
kernel, and application benchmark results. In particular,
we provide performance results for strategic Department
of Energy applications areas including climate and fusion.
We demonstrate experiments on the installed system,
scaling applications up to 4,096 processors.
1 Introduction
Computational requirements for many large-scale
simulations and ensemble studies of vital interest to the
Department of Energy (DOE) exceed what is currently
offered by any U.S. computer vendor. As illustrated in the
DOE Scales report [32] and the High End Computing
Revitalization Task Force report [18], examples are
numerous, ranging from global climate change research to
combustion to biology.
Performance of the current class of HPC architectures
is dependent on the performance of the memory hierarchy,
ranging from the processor-to-cache latency and
bandwidth to the latency and bandwidth of the
interconnect between nodes in a cluster, to the latency and
bandwidth in accesses to the file system. With increasing
chip clock rates and number of functional units per
processor and the lack of corresponding improvements in
memory access latencies, this dependency will only
increase. Single processor performance, or the
performance of a small system, is relatively simple to
determine. However, given reasonable sequential
performance, the metric of interest in evaluating the ability
of a system to achieve multi-Teraop performance is
scalability. Here, scalability includes the performance
sensitivity to variation in both problem size and the
number of processors or other computational resources
utilized by a particular application.
ORNL has been evaluating these critical factors on
several platforms that include the Cray X1 [1], the SGI
Altix 3700 [13], and the Cray XD1 [15]. This report
describes initial evaluation results collected on an early
version of the Cray XT3 sited at ORNL. Recent results are
also publicly available from the ORNL evaluation web site
[25]. We have been working closely with Cray, Sandia
National Laboratory, and Pittsburgh Supercomputing
Center to install and evaluate our XT3.
2 Cray XT3 System Overview
The XT3 is Cray’s third-generation massively parallel
processing system. It follows a similar design to the
successful Cray T3D and Cray T3E [29] systems. As in
these previous systems, the XT3 builds upon a single
processor node, or processing element (PE). However,
unlike the T3D and T3E, the XT3 uses a commodity
microprocessor—the AMD Opteron—at its core. The XT3
connects these processors with a customized interconnect
managed by a Cray-designed Application-Specific
Integrated Circuit (ASIC) called SeaStar.
2.1 Processing Elements
As Figure 1 shows, each PE has one Opteron processor
with its own dedicated memory and communication
resource. The XT3 has two types of PEs: compute PEs and
service PEs. The compute PEs are optimized for
application performance and run a lightweight operating
system kernel called Catamount. In contrast, the service
PEs run SuSE Linux and are configured for I/O, login,
network, or system functions.
The XT3 uses a blade approach for achieving high
processor density per system cabinet. On the XT3, a
compute blade holds four compute PEs (or nodes), and
eight blades are contained in a chassis. Each XT3 cabinet
holds three chassis, for a total of 96 compute processors
per cabinet. In contrast, service blades host two service
PEs and provide PCI-X slots..
The ORNL XT3 uses Opteron model 150 processors.
As Figure 2 shows, this model includes an Opteron core,
integrated memory controller, three 16b-wide 800 MHz
HyperTransport (HT) links, and L1 and L2 caches. The
2. Opteron core has three integer units and one floating point
unit capable of two floating-point operations per cycle [3].
Because the processor core is clocked at 2.4 GHz, the peak
floating point rate of each compute node is 4.8 GFlops.
Figure 1: Cray XT3 Architecture (Image courtesy of Cray).
The memory structure of the Opteron consists of a
64KB 2-way associative L1 data cache, a 64KB 2-way
associative L1 instruction cache, and a 1MB 16-way
associative, unified L2 cache. The Opteron has sixteen 64b
integer registers, eight 64b floating point registers, sixteen
128b SSE/SSE2 registers, and uses 48b virtual addresses
and 40b physical addresses. The memory controller data
width is 128b. Each PE has 2 GB of memory but only 1
GB is usable with the kernel used for our evaluation. The
memory DIMMs are 1 GB PC3200, Registered ECC, 18 x
512 mbit parts that support Chipkill. The peak memory
bandwidth per processor is 6.4 GB/s.
In general, processors that support SMP configurations
have larger memory access latencies than those that do not
support SMP configurations, due to the additional circuitry
for coordinating memory accesses and for managing
memory coherence across SMP processors. Furthermore,
on-chip memory controllers enable smaller access
latencies than off-chip memory controllers (called
“Northbridge” chips in Intel chipsets). The Opteron 150
used in our XT3 does not support SMP configurations
because none of its HyperTransport links support the
coherent HyperTransport protocol. Also, the Opteron 150
has an on-chip memory controller. As a result, memory
access latencies with the Opteron 150 are in the 50-60 ns
range. These observations are quantified in Section 4.1.
The Opteron’s processor core has a floating-point
execution unit (FPU) that handles all register operations
for x87 instructions, 3DNow! operations, all MMX
operations, and all SSE and SSE2 operations. This FPU
contains a scheduler, a register file, a stack renaming unit,
a register renaming unit, and three parallel execution units.
One execution unit is known as the adder pipe (FADD); it
contains a MMX ALU/shifter and floating-point adder.
The next execution unit is known as the multiplier
(FMUL); it provides the floating-point
multiply/divide/square root operations and also an MMX
ALU. The final unit supplies floating-point load/store
(FSTORE) operations.
Figure 2: AMD Opteron Design (Image courtesy of AMD).
2.2 Interconnect
Each Opteron processor is directly connected to the
XT3 interconnect via a Cray SeaStar chip (see Figure 1).
This SeaStar chip is a routing and communications chip
and it acts as the gateway to the XT3’s high-bandwidth,
low-latency interconnect. The PE is connected to the
SeaStar chip with a 6.4 GB/s HT path. SeaStar provides
six high-speed network links to connect to neighbors in a
3D torus/mesh topology. Each of the six links has a peak
bandwidth of 7.6 GB/s with sustained bandwidth of
around 4 GB/s. In the XT3, the interconnect carries all
message passing traffic as well as I/O traffic to the
system’s Lustre parallel file system.
Each SeaStar ASIC contains:
• HyperTransport link [4]—this enables the SeaStar
chip to communicate with the Opteron processor over
parallel links without bus arbitration overheads.
• PowerPC 440 processor—this communications and
management processor cooperates with the Opteron to
synchronize and to schedule communication tasks.
• Direct Memory Access (DMA) engine—the DMA
engine and the PowerPC processor work together to
off-load message preparation and demultiplexing tasks
from the Opteron processor.
• router—the router provides six network links to the six
neighboring processors in a 3D torus topology.
• service port—this port bridges between the separate
management network and the Cray SeaStar local bus.
The service port allows the management system to
access all registers and memory in the system and
facilitates booting, maintenance and system
monitoring. Furthermore, this interface can be used to
reconfigure the router in the event of failures.
The ORNL Cray XT3 has 56 cabinets holding 5,212
compute processors and 82 service processors. Its nodes
are connected in a three-dimensional mesh of size 14 x 16
x 24, with torus links in the first and third dimension.
3. 2.3 Software
The Cray XT3 inherits several aspects of its systems
software approach from a sequence of systems developed
and deployed at Sandia National Laboratories: ASCI Red
[23], the Cplant [7, 27], and Red Storm [6]. The XT3 uses
a lightweight kernel operating system on its compute PEs,
a user-space communications library, and a hierarchical
approach for scalable application start-up.
The XT3 uses two different operating systems:
Catamount on compute PEs and Linux on service PEs.
Catamount is the latest in a sequence of lightweight kernel
operating systems developed at Sandia and the University
of New Mexico, including SUNMOS [21], Puma [33], and
Cougar. (Cougar is the product name for the port of Puma
to the Intel ASCI Red system.) For scalability and
performance predictability, each instance of the
Catamount kernel runs only one single-threaded process
and does not provide services like demand-paged virtual
memory that could cause unpredictable performance
behavior. Unlike the compute PEs, service PEs (i.e., login,
I/O, network, and system PEs) run a full SuSE Linux
distribution to provide a familiar and powerful
environment for application development and for hosting
system and performance tools.
The XT3 uses the Portals [8] data movement layer for
flexible, low-overhead inter-node communication. Portals
provide connectionless, reliable, in-order delivery of
messages between processes. For high performance and to
avoid unpredictable changes in the kernel’s memory
footprint, Portals deliver data from a sending process’ user
space to the receiving process’ user space without kernel
buffering. Portals supports both one-sided and two-sided
communication models. Portals supports multiple higher-
level communication protocols, including protocols for
MPI message passing between application processes and
for transferring data to and from I/O service PEs.
The Cray XT3 programming environment includes
compilers, communication libraries, and correctness and
performance tools [11]. The Portland Group’s C, C++,
and Fortran compilers are available. Cray-provided
compiler wrappers ease the development of parallel
applications for the XT3 by automatically including
compiler and linker switches needed to use the XT3’s
communication libraries. The primary XT3
communication libraries provide the standard MPI-2
message passing interface and Cray’s SHMEM interface.
Low level communication can be performed using the
Portals API. The Etnus TotalView debugger is available
for the XT3, and Cray provides the Apprentice2
tool for
performance analysis.
The primary math library is the AMD Core Math
Library (ACML). It incorporates BLAS, LAPACK and
FFT routines, and is optimized for high performance on
AMD platforms. This library is available both as a 32-bit
library for compatibility with legacy x86 applications, and
as a 64-bit library that is designed to fully exploit the large
memory space and improved performance offered by the
new AMD64 architecture.
3 Evaluation Overview
As a function of the Early Evaluation project at ORNL,
numerous systems have been rigorously evaluated using
important DOE applications. Recent evaluations have
included the Cray X1 [12], the SGI Altix 3700 [13], and
the Cray XD1 [15].
The primary goals of these evaluations are to
1) determine the most effective approaches for using the
each system, 2) evaluate benchmark and application
performance, both in absolute terms and in comparison
with other systems, and 3) predict scalability, both in
terms of problem size and in number of processors. We
employ a hierarchical, staged, and open approach to the
evaluation, examining low-level functionality of the
system first, and then using these results to guide and
understand the evaluation using kernels, compact
applications, and full application codes. The distinction
here is that the low-level benchmarks, for example,
message passing, and the kernel benchmarks are chosen to
model important features of a full application. This
approach is also important because several of the
platforms contain novel architectural features that make it
difficult to predict the most efficient coding styles and
programming paradigms. Performance activities are staged
to produce relevant results throughout the duration of the
system installation. For example, subsystem performance
will need to be measured as soon as a system arrives, and
measured again following a significant upgrade or system
expansion.
For comparison, performance data is also presented for
the following systems:
• Cray X1 at ORNL: 512 Multistreaming processors
(MSP), each capable of 12.8 GFlops/sec for 64-bit
operations. Each MSP is comprised of four single
streaming processors (SSPs). The SSP uses two clock
frequencies, 800 MHz for the vector units and 400
MHz for the scalar unit. Each SSP is capable of 3.2
GFlops/sec for 64-bit operations.
• Cray X1E at ORNL: 1024 Multistreaming processors
(MSP), each capable of 18 GFlops/sec for 64-bit
operations. Each MSP is comprised of four single
streaming processors (SSPs). The SSP uses two clock
frequencies, 1130 MHz for the vector units and 565
MHz for the scalar unit. Each SSP is capable of 4.5
GFlops/sec for 64-bit operations. This system is an
upgrade of the original Cray X1 at ORNL.
• Cray XD1 at ORNL: 144 AMD 2.2GHz Opteron 248
processors, configured as 72, 2-way SMPs with 4GB of
memory per processor. The processors are
4. interconnected by Cray’s proprietary RapidArray
interconnect fabric.
• Earth Simulator: 640 8-way vector SMP nodes and a
640x640 single-stage crossbar interconnect. Each
processor has 8 64-bit floating point vector units
running at 500 MHz.
• SGI Altix at ORNL: 256 Itaninium2 processors and a
NUMAlink switch. The processors are 1.5 GHz
Itanium2. The machine has an aggregate of 2 TB of
shared memory.
• SGI Altix at NASA: Twenty Altix 3700 Bx2 nodes,
where each node contains 512 Itanium2 processors
running at 1.6 GHz with SGI's NUMAflex
interconnect. We used two such nodes, connected by a
NUMAlink4 switch.
• IBM p690 cluster at ORNL: 27 32-way p690 SMP
nodes and an HPS interconnect. Each node has two
HPS adapters, each with two ports. The processors are
the 1.3 GHz POWER4.
• IBM SP at the National Energy Research
Supercomputer Center (NERSC): 184 Nighthawk(NH)
II 16-way SMP nodes and an SP Switch2. Each node
has two interconnect interfaces. The processors are the
375MHz POWER3-II.
• IBM Blue Gene/L at ANL: a 1024-node Blue Gene/L
system at Argonne National Laboratory. Each Blue
Gene/L processing node consists of an ASIC with two
PowerPC processor cores, on-chip memory and
communication logic. The total processing power per
node is 2.8 GFlops per processor or 5.6 GFlops per
processing node.
• IBM POWER5 at ORNL: An IBM 9124-720 system
with two dual-core 1.65 GHz POWER5 processors and
16 GB of memory, running Linux.
4 Micro-benchmarks
The objective of micro-benchmarking is to characterize
the performance of the specific architectural components
of the platform. We use both standard benchmarks and
customized benchmarks. The standard benchmarks allow
consistent historical comparisons across platforms. The
custom benchmarks permit the unique architectural
features of the system (e.g., global address space memory)
to be tested with respect to the target applications.
Traditionally, our micro-benchmarking focuses on the
arithmetic performance, memory-hierarchy performance,
task and thread performance, message-passing
performance, system and I/O performance, and parallel
I/O. However, because the XT3 has a single processor
node and it uses a lightweight operating system, we focus
only on these areas:
• Arithmetic performance, including varying instruction
mix, identifying what limits computational
performance.
• Memory-hierarchy performance, including levels of
cache and shared memory.
• Message-passing performance, including intra-node,
inter-node, and inter-OS image MPI performance for
one-way (ping-pong) messages, message exchanges,
and collective operations (broadcast, all-to-all,
reductions, barriers); message-passing hotspots and the
effect of message passing on the memory subsystem
are studied.
Table 1: STREAM Triad Performance.
Processor Triad Bandwidth
(GB/s)
Cray XT3 5.1
Cray XD1 4.1
Cray X1 MSP 23.8
IBM p690 2.1
IBM POWER5 4.0
SGI Altix 3.8
Current, detailed micro-benchmark data for all existing
evaluations is available at our Early Evaluation
website [25].
4.1 Memory Performance
The memory performance of current architectures is a
primary factor for performance on scientific applications.
Table 1 illustrates the differences in measured memory
bandwidth on the triad STREAM benchmark. The very
high bandwidth of the Cray X1 MSP clearly dominates the
other processors, but the Cray XT3’s Opteron has the
highest bandwidth of the other microprocessor-based
systems.
Table 2: Latency to Main Memory.
Platform
Measured Latency to
Main Memory (ns)
Cray XT3 / Opteron 150 / 2.4 GHz 51.41
Cray XD1 / Opteron 248 / 2.2 GHz 86.51
IBM p690 / POWER4 / 1.3 GHz 90.57
Intel Xeon / 3.0 GHz 140.57
As discussed earlier, the choice of the Opteron model
150 was motivated in part to provide low access latency to
main memory. As Table 2 shows, our measurements
revealed that the Opteron 150 has lower latency than the
Opteron 248 configured as a 2-way SMP in the XD1.
Furthermore, it has considerably smaller latency than
either the POWER4 or the Intel Xeon, which both support
multiprocessor configurations.
The memory hierarchy of the XT3 compute node is
obvious when measured with the CacheBench tool [24].
Figure 3 shows that the system reaches a maximum of
approximately 9 GB/s when accessing vectors of data in
the L2 cache. When data is accessed from main memory,
the bandwidth drops to about 3 GB/s.
4.2 Scientific Operations
We use a collection of micro-benchmarks to
characterize the performance of the underlying hardware,
compilers, and software libraries for common operations
5. in computational science. The micro-benchmarks measure
computational performance, memory hierarchy
performance, and inter-processor communication. Figure 4
compares the double-precision floating point performance
of a matrix multiply (DGEMM) on a single processor
using the vendors’ scientific libraries. The XT3 Opteron
achieves 4 GFlops using the ACML version 2.6 library,
about 83% of peak.
Figure 3: CacheBench read results for a single XT3
compute node.
Figure 4: Performance of Matrix Multiply.
Fast Fourier Transforms are another operation
important to many scientific and signal processing
applications. Figure 5 plots 1-D FFT performance using
the vendor library (-lacml, -lscs, -lsci or -lessl), where
initialization time is not included. The XT3’s Opteron is
outperformed by the SGI Altix’s Itanium2 processor for
all vector lengths examined, but does better than the X1
for short vectors.
In general, our micro-benchmark results show the
promise of the Cray XT3 compute nodes for scientific
computing. Although the Cray X1’s high memory
bandwidth provided a clear benefit over the other systems
we considered, and the SGI Altix and IBM Power5
systems gave better performance for several micro-
benchmarks, the XT3 showed solid performance, and in
some cases, it performed better than these other systems
for short vector lengths.
Figure 5: Performance of 1-D FFT using vendor
libraries.
Figure 6: Latency of MPI PingPong.
Figure 7: MPI PingPong benchmark bandwidth.
4.3 MPI
Because of the predominance of the message-passing
programming model in contemporary scientific
applications, examining the performance of message-
passing operations is critical to understanding a system’s
expected performance characteristics when running full
applications. Because most applications use the Message
Passing Interface (MPI) library [30], we evaluated the
6. latency and bandwidth of each vendor’s MPI
implementation. For our evaluation, we used the Pallas
MPI Benchmark suite, version 2.2 (now offered by Intel as
the Intel MPI Benchmarks).
Figure 6 and Figure 7 shows the latency and
bandwidth, respectively, for the Pallas MPI PingPong
benchmark. We observe a latency of about 8 microseconds
for a 4 byte message, and a bandwidth of about 1.0 GB/s
for messages around 64KB.
Figure 8: MPI Exchange benchmark latency.
Figure 9: Bandwidth of Pallas exchange operation.
Figure 8 and Figure 9 show the latency and bandwidth
(respectively) for the Pallas exchange benchmark at 4,096
processors. This test separates all tasks into two groups,
and then uses the MPI_SendRecv operation to transfer
data between pairs of tasks, where the endpoints are in
separate groups. As opposed to the PingPong operation,
which transfers messages between only two tasks, the
exchange benchmark has all pairs transferring messages at
the same time. The average latency of these transfers is
higher, on the order of 20 microseconds for a 4 byte
message. The bandwidth is also less than that for the
PingPong test, but it reaches an average of nearly 700
MB/s for an individual transfer, in the context of 2,048
simultaneous transfers.
The latency for an MPI_Allreduce operation across
various payload sizes and processor counts is shown in
Figure 10. The MPI_Allreduce operation is particularly
important to several DOE simulation applications because
it may be used multiple times within each simulation
timestep. Further, its blocking semantics also requires that
all tasks wait for its completion before continuing, so
latency for this operation is very important for application
scalability.
Figure 10: Latency for MPI_Allreduce of across 4,096
processors.
5 Applications
Insight into the performance characteristics of low-level
operations is important to understand overall system
performance, but because a system’s behavior when
running full applications is the most significant measure of
its performance, we also investigate the performance and
efficiency of full applications relevant to the DOE Office
of Science in the areas of global climate, fusion,
chemistry, and bioinformatics. The evaluation team
worked closely with principal investigators leading the
Scientific Discovery through Advanced Computing
(SciDAC) application teams to identify important
applications.
5.1 Parallel Ocean Program (POP)
The Parallel Ocean Program (POP) [19] is the ocean
component of CCSM [5] and is developed and maintained
at Los Alamos National Laboratory (LANL). The code is
based on a finite-difference formulation of the three-
dimensional flow equations on a shifted polar grid. In its
high-resolution configuration, 1/10-degree horizontal
resolution, the code resolves eddies for effective heat
transport and the locations of ocean currents.
For our evaluation, we used a POP benchmark
configuration called x1 that represents a relatively coarse
resolution similar to that currently used in coupled climate
models. The horizontal resolution is roughly one degree
(320x384) and uses a displaced-pole grid with the pole of
7. the grid shifted into Greenland and enhanced resolution in
the equatorial regions. The vertical coordinate uses 40
vertical levels with a smaller grid spacing near the surface
to better resolve the surface mixed layer. Because this
configuration does not resolve eddies, it requires the use of
computationally intensive subgrid parameterizations. This
configuration is set up to be identical to the production
configuration of the Community Climate System Model
with the exception that the coupling to full atmosphere, ice
and land models has been replaced by analytic surface
forcing.
Figure 11: Performance of POP.
POP performance is characterized by the performance
of two phases: baroclinic and barotropic. The baroclinic
phase is three dimensional with limited nearest-neighbor
communication and typically scales well on all platforms.
In contrast, performance of the barotropic phase is
dominated by the performance of a two-dimensional,
implicit solver whose performance is very sensitive to
network latency and typically scales poorly on all
platforms.
Figure 12: Performance of POP barotropic phase.
Figure 11 shows a platform comparison of POP
throughput for the x1 benchmark problem. On the Cray
X1E, we considered an MPI-only implementation and also
an implementation that uses a Co-Array Fortran (CAF)
implementation of a performance-sensitive halo update
operation. All other results were for MPI-only versions of
POP. The XT3 performance is similar to that of the SGI
Altix up to 256 processors, and continues to scale out to
1024 processors even for this small fixed size problem.
Figure 12 shows the performance of the barotropic
portion of POP. While lower latencies on the Cray X1E
and SGI Altix systems give them an advantage over the
XT3 for this phase, the XT3 showed good scalability in
the sense that the cost does not increase significantly out
to 1024 processors. Figure 13 shows the performance of
the baroclinic portion of POP. The Cray XT3 performance
was very similar to that of the SGI Altix, and showed
excellent scalability.
Figure 13: Performance of POP baroclinic phase.
5.2 GYRO
GYRO [9] is a code for the numerical simulation of
tokamak microturbulence, solving time-dependent,
nonlinear gyrokinetic-Maxwell equations with gyrokinetic
ions and electrons capable of treating finite
electromagnetic microturbulence. GYRO uses a five-
dimensional grid and propagates the system forward in
time using a fourth-order, explicit Eulerian algorithm.
GYRO has been ported to a variety of modern HPC
platforms including a number of commodity clusters.
Since code portability and flexibility are considered
crucial to this code’s development team, only a single
source tree is maintained and platform-specific
optimizations are restricted to a small number of low-level
operations such as FFTs. Ports to new architectures often
involve nothing more than the creation of a new makefile.
For our evaluation, we ran GYRO for the B3-GTC
benchmark problem. Interprocess communication for this
problems is dominated by “transposes” used to change the
domain decomposition within each timestep. The
transposes are implemented using simultaneous
MPI_Alltoall collective calls over subgroups of processes.
Figure 14 shows platform comparisons of GYRO
throughput for the B3-GTC benchmark problem. Note that
there is a strong algorithmic preference for power-of-two
numbers of processors for large processor counts, arising
8. from significant redundant work when not using a power-
of-two number of processes. This impacts performance
differently on the different systems. The Altix is somewhat
superior to the XT3 out to 960 processors, but XT3
scalability is excellent, achieving the best overall
performance at 4,096 processors.
Figure 14: GYRO Performance for B3-GTC input.
5.3 S3D
S3D is a code used extensively to investigate first-of-a-
kind fundamental turbulence-chemistry interactions in
combustion topics ranging from premixed flames [10, 16],
auto-ignition [14], to nonpremixed flames [17, 22, 31]. It
is baesd on a high-order accurate, non-dissipative
numerical scheme. Time advancement is achieved through
a fourth-order explicit Runge-Kutta method, differencing
is achieved through high-order (eighth-order with tenth-
order filters) finite differences on a Cartesian, structured
grid, and Navier-Stokes Characteristic Boundary
Conditions (NSCBC) are used to prescribe the boundary
conditions. The equations are solved on a conventional
structured mesh.
This computational approach is very appropriate for
direct numerical simulation of turbulent combustion. The
coupling of high-order finite difference methods with
explicit Runge-Kutta time integration make very effective
use of the available resources, obtaining spectral-like
spatial resolution without excessive communication
overhead and allowing scalable parallelism.
For our evaluation, the problem configuration is a 3D
direct numerical simulation of a slot-burner bunsen flame
with detailed chemistry. This includes methane-air
chemistry with 17 species and 73 elementary reactions.
This simulation used 80 million grid points. The
simulation is part of a parametric study performed on
different Office of Science computing platforms: the IBM
SP at NERSC, the HP Itanium2 cluster at PNNL, and the
ORNL Cray X1E and XT3. Figure 15 shows that S3D
scales well across the various platforms and exhibited a
90% scaling efficiency on the Cray XT3.
Figure 15: S3D performance.
5.4 Molecular Dynamics Simulations
Molecular dynamics (MD) simulations enable the study
of complex, dynamic processes that occur in biological
systems [20]. The MD related methods are now routinely
used to investigate the structure, dynamics, functions, and
thermodynamics of biological molecules and their
complexes. The types of biological activity that has been
investigated using MD simulations include protein
folding, enzyme catalysation, conformational changes
associated with bimolecular function, and molecular
recognition of proteins, DNA, biological membrane
complexes. Biological molecules exhibit a wide range of
time and length scales over which specific processes
occur, hence the computational complexity of an MD
simulation depends greatly on the time and length scales
considered. With a solvation model, typical system sizes
of interest range from 20,000 atoms to more than 1 million
atoms; if the solvation is implicit, sizes range from a few
thousand atoms to about 100,000.. The time period of
simulation can range from pico-seconds to the a few
micro-seconds or longer.
Several commercial and open source software
frameworks for MD calculations are in use by a large
community of biologists, including AMBER [26] and
LAMMPS [28]. These packages use slightly different
forms of potential function and also their own force-field
calculations. Some of them are able to use force-fields
from other packages as well. AMBER provides a wide
range of MD algorithms. The version of LAMMPS used in
our evaluation does not use the energy minimization
technique, which is commonly used in biological
simulations.
AMBER. AMBER consists of about 50 programs that
perform a diverse set of calculations for system
preparation, energy minimization (EM), molecular
dynamics (MD), and analysis of results. AMBER's main
module for EM and MD is known as sander (for simulated
annealing with NMR-derived energy restraints). We used
sander to investigate the performance characteristics of
9. EM and MD techniques using the Particle Mesh Ewald
(PME) and Generalized Born (GB) methods. We
performed a detailed analysis of PME and GB algorithms
on massively parllel systems (including the XT3) in other
work [2].
Figure 16: AMBER Simulation Throughput
The bio-molecular systems used for our experiments
were designed to represent the variety of complexes
routinely investigated by computational biologists. In
particular, we considered the RuBisCO enzyme based on
the crystal structure 1RCX, using the Generalized Born
menthod for implicit solvent. The model consists of
73,920 atoms. In Figure 16, we represent the performance
of the code in simulation throughput, expressed as
simulation pico-seconds per real day (psec/day).The
performance on the Cray XT3 is very good for large scale
experiments, showing a throughput of over twice the other
architectures we investigated.
Figure 17: LAMMPS parallel efficiency with
approximately 290K atoms
LAMMPS. LAMMPS (Large-scale Atomic/Molecular
Massively Parallel Simulator) [28] is a classical MD code.
LAMMPS models an ensemble of particles in a liquid,
solid or gaseous state and can be used to model atomic,
polymeric, biological, metallic or granular systems. The
version we used for our experiments is written in C++ and
MPI.
For our evaluation, we considered the RAQ system
which is a model on the enzyme RuBisCO. This model
consists of 290,220 atoms with explicit treatment of
solvent. We observed very good performance for this
problem on the Cray XT3 (see Figure 17), with over 60%
efficiency on up to 1024 processors and over 40%
efficiency on 4096 processor run.
6 Conclusions and Plans
Oak Ridge National Laboratory has received and
installed a 5,294 processor Cray XT3. In this paper we
describe our performance evaluation of the system as it
was being deployed, including micro-benchmark, kernel,
and application benchmark results. We focused on
applications from important Department of Energy
applications areas including climate and fusion. In
experiments with up to 4096 processors, we observed that
the Cray XT3 shows tremendous potential for supporting
the Department of Energy application workload, with
good scalar processor performance and high interconnect
bandwidth when compared to other microprocessor-based
systems.
Acknowledgements
This research was sponsored by the Office of
Mathematical, Information, and Computational Sciences,
Office of Science, U.S. Department of Energy under
Contract No. DE-AC05-00OR22725 with UT-Battelle,
LLC. Accordingly, the U.S. Government retains a non-
exclusive, royalty-free license to publish or reproduce the
published form of this contribution, or allow others to do
so, for U.S. Government purposes.
Also, we would like to thank Jeff Beckleheimer, John
Levesque, Nathan Wichmann, and Jim Schwarzmeier of
Cray, and Don Maxwell of ORNL for all their assistance
in this endeavor.
We gratefully acknowledge Jeff Kuehn of ORNL for
collection of performance data on the BG/L system and
Hongzhang Shan of LBL for collection of performance
data on the IBM SP. We thank the National Energy
Research Scientific Computing Center for access to the
IBM SP, Argonne National Laboratory for access to the
IBM BG/L, the NASA Advanced Supercomputing
Division for access to their SGI Altix, and the ORNL
Center for Computational Sciences (CCS) for access to the
Cray X1, Cray X1E, Cray XD1, Cray XT3, IBM p690
cluster, and SGI Altix. The CCS is supported by the Office
of Science of the U.S. Department of Energy under
Contract No. DE-AC05-00OR22725.
10. References
[1] P.A. Agarwal, R.A. Alexander et al., “Cray X1
Evaluation Status Report,” ORNL, Oak Ridge, TN,
Technical Report ORNL/TM-2004/13, 2004,
http://www.csm.ornl.gov/evaluation/PHOENIX/PDF/C
RAYEvaluationTM2004-15.pdf.
[2] S.R. Alam, P. Agarwal et al., “Performance
characterization of bio-molecular simulations using
molecular dynamics,” Proc. ACM SIGPLAN
Symposium on Principles and Practice of Parallel
Programming (PPOPP), 2006.
[3] AMD, “Software Optimization Guide for AMD
Athlon™ 64 and AMD Opteron™ Processors,”
Technical Manual 25112, 2004.
[4] D. Anderson, J. Trodden, and MindShare Inc.,
HyperTransport system architecture. Reading, MA:
Addison-Wesley, 2003.
[5] M.B. Blackmon, B. Boville et al., “The Community
Climate System Model,” BAMS, 82(11):2357-76, 2001.
[6] R. Brightwell, W. Camp et al., “Architectural
Specification for Massively Parallel Computers-An
Experience and Measurement-Based Approach,”
Concurrency and Computation: Practice and
Experience, 17(10):1271-316, 2005.
[7] R. Brightwell, L.A. Fisk et al., “Massively Parallel
Computing Using Commodity Components,” Parallel
Computing, 26(2-3):243-66, 2000.
[8] R. Brightwell, R. Riesen et al., “Portals 3.0: Protocol
Building Blocks for Low Overhead Communication,”
Proc. Workshop on Communication Architecture for
Clusters (in conjunction with International Parallel &
Distributed Processing Symposium), 2002, pp. 164-73.
[9] J. Candy and R. Waltz, “An Eulerian gyrokinetic-
Maxwell solver,” J. Comput. Phys., 186(545), 2003.
[10] J.H. Chen and H.G. Im, “Stretch effects on the Burning
Velocity of turbulent premixed ydrogen-Air Flames,”
Proc. Comb. Inst, 2000, pp. 211-8.
[11] Cray Incorporated, “Cray XT3 Programming
Environment User's Guide,” Reference Manual S-
2396-10, 2005.
[12] T.H. Dunigan, Jr., J.S. Vetter et al., “Performance
Evaluation of the Cray X1 Distributed Shared Memory
Architecture,” IEEE Micro, 25(1):30-40, 2005.
[13] T.H. Dunigan, Jr., J.S. Vetter, and P.H. Worley,
“Performance Evaluation of the SGI Altix 3700,” Proc.
International Conf. Parallel Processing (ICPP), 2005.
[14] T. Echekki and J.H. Chen, “Direct numerical
simulation of autoignition in non-homogeneous
hydrogen-air mixtures,” Combust. Flame, 134:169-91,
2003.
[15] M.R. Fahey, S.R. Alam et al., “Early Evaluation of the
Cray XD1,” Proc. Cray User Group Meeting, 2005, pp.
12.
[16] E.R. Hawkes and J.H. Chen, “Direct numerical
simulation of hydrogen-enriched lean premixed
methane-air flames,” Combust. Flame, 138(3):242-58,
2004.
[17] E.R. Hawkes, R. Sankaran et al., “Direct numerical
simulation of turbulent combustion: fundamental
insights towards predictive models,” Proc. SciDAC PI
Meeting, 2005.
[18] High-End Computing Revitalization Task Force
(HECRTF), “Federal Plan for High-End Computing,”
Executive Office of the President, Office of Science
and Technology Policy, Washington, DC 2004.
[19] P.W. Jones, P.H. Worley et al., “Practical performance
portability in the Parallel Ocean Program (POP),”
Concurrency and Computation: Experience and
Practice(in press), 2004.
[20] M. Karplus and G.A. Petsko, “Molecular dynamics
simulations in biology,” Nature, 347, 1990.
[21] A.B. Maccabe, K.S. McCurley et al., “SUNMOS for
the Intel Paragon: A Brief User’s Guide,” Proc. Intel
Supercomputer Users’ Group, 1994, pp. 245-51.
[22] S. Mahalingam, J.H. Chen, and L. Vervisch, “Finite-
rate chemistry and transient effects in direct
numerical simulations of turbulent non-premixed
flames,” Combust. Flame, 102(3):285-97, 1995.
[23] T.G. Mattson, D. Scott, and S.R. Wheat, “A TeraFLOP
Supercomputer in 1996: The ASCI TFLOP System,”
Proc. 10th International Parallel Processing
Symposium (IPPS 96), 1996, pp. 84-93.
[24] P.J. Mucci, K. London, and J. Thurman, “The
CacheBench Report,” University of Tennessee,
Knoxville, TN 1998.
[25] Oak Ridge National Laboratory, Early Evaluation
Website, http://www.csm.ornl.gov/evaluation, 2005.
[26] D.A. Pearlman, D.A. Case et al., “AMBER, a package
of computer programs for applying molecular
mechanics, normal mode analysis, molecular dynamics
and free energy calculations to simulate the structural
and energetic properties of molecules,” Computer
Physics Communication, 91, 1995.
[27] K. Pedretti, R. Brightwell, and J. Williams, “Cplant
Runtime System Support for Multi-Processor and
Heterogeneous Compute Notes,” Proc. IEEE
International Conference on Cluster Computing
(CLUSTER 2002), 2002, pp. 207-14.
[28] S.J. Plimpton, “Fast Parallel Algorithms for Short-
Range Molecular Dynamics,” in Journal of
Computational Physics, vol. 117, 1995
[29] S.L. Scott, “Synchronization and Communication in
the T3E Multiprocessor,” Proc. Architectural Support
for Programming Languages and Operating Systems
(ASPLOS), 1996, pp. 26-36.
[30] M. Snir, S. Otto et al., Eds., MPI--the complete
reference, 2nd ed. Cambridge, MA: MIT Press, 1998.
[31] J.C. Sutherland, P.J. Smith, and J.H. Chen,
“Quantification of Differential Diffusion in
Nonpremixed Systems,” Combust. Theory and
Modelling (to appear), 2005.
[32] US Department of Energy Office of Science, “A
Science-Based Case for Large-Scale Simulation,” US
Department of Energy Office of Science 2003,
http://www.pnl.gov/scales.
[33] S.R. Wheat, A.B. Maccabe et al., “PUMA: An
Operating System for Massively Parallel Systems,”
Journal of Scientific Programming (special issue on
operating system support for massively parallel
systems), 3(4):275-88, 1994.